Build RAG with LangChain: The Standard Production Approach
You ask an LLM about your company's refund policy and it confidently invents one. You paste a 200-page PDF into the prompt and blow past the context window. You split the PDF into chunks and the model ignores the relevant ones. These are the three problems that Retrieval-Augmented Generation solves, and LangChain is the framework most production teams reach for first.
What Is RAG and Why Does Every Production Team Need It?
RAG stands for Retrieval-Augmented Generation. Instead of asking an LLM to answer from its training data alone, you first retrieve the relevant documents from your own data, then augment the LLM's prompt with those documents, and finally the LLM generates an answer grounded in your actual content.
I think of it as giving the model an open-book exam. Without RAG, the model answers from memory — which means hallucinations about your specific data. With RAG, you hand it the exact pages it needs, and it synthesises an answer from those pages.
The RAG pipeline has five stages, and every production system follows this same structure:
LangChain is the most widely adopted framework for building these pipelines. It provides standardised interfaces for each stage — document loaders, text splitters, embedding models, vector stores, and retrieval chains — so you can swap components without rewriting your pipeline. Whether you use OpenAI or Anthropic, ChromaDB or Pinecone, the code structure stays the same.
Stage 1: Loading Documents into LangChain
Before the LLM can answer questions about your data, LangChain needs to read that data. LangChain provides over 100 document loaders — one for each source type. The loader reads the raw content and returns a list of Document objects, each containing the text and metadata (source file, page number, etc.).
Let's start with the simplest case: loading a text file. Then we'll add PDFs and web pages.
Each Document object has two fields: page_content (the actual text as a string) and metadata (a dictionary with source information). The metadata becomes important later when you want to show the user where the answer came from.
For PDFs, LangChain splits by page automatically. Each page becomes its own Document:
In production, I almost always load from multiple sources. LangChain's DirectoryLoader handles an entire folder at once. You specify a glob pattern and the loader class for that file type:
Stage 2: Splitting Documents into Chunks
This is where most RAG pipelines silently fail. A 50-page document is too large to fit in a single embedding or a single prompt. You have to split it into smaller chunks. But split badly — in the middle of a sentence, or with chunks so small they lose context — and your retrieval will return irrelevant fragments.
LangChain's RecursiveCharacterTextSplitter is the workhorse here. It tries to split on paragraph boundaries first, then sentences, then words — falling back to smaller separators only when necessary. Two parameters control everything: chunk_size (maximum characters per chunk) and chunk_overlap (how many characters overlap between adjacent chunks).
The separators list defines the split hierarchy. The splitter tries \n\n (paragraph breaks) first. If a paragraph is still larger than chunk_size, it falls back to \n (line breaks), then . (sentences), then spaces, and finally individual characters as a last resort.
The chunk_overlap parameter is subtle but critical. When you split a document at position 1000, the next chunk starts at position 800 (with 200-character overlap). This means a sentence that was cut at the boundary still appears complete in at least one chunk. Without overlap, you lose context at every split point.
Let me show you what happens with different chunk sizes on the same document — this is the kind of experiment I run on every new RAG project before committing to a configuration:
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=0 # chunks are isolated islands
)
# Result: sentence cut at position 500 is
# incomplete in BOTH chunkssplitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100 # 100 chars shared
)
# Result: boundary sentence appears
# complete in at least one chunkStage 3: Embeddings — Converting Text to Vectors
Here is the core idea behind all of RAG retrieval: you convert text into a list of numbers (a vector) such that texts with similar meaning end up with similar vectors. "How do I return a product?" and "What is your refund policy?" have almost no words in common, but their embedding vectors will be very close — because they mean roughly the same thing.
This is fundamentally different from keyword search. A keyword search for "refund" would miss the first question entirely. An embedding-based search finds it instantly because the meaning is similar.
The text-embedding-3-small model converts any text into a 1536-dimensional vector. Each number in that vector captures some aspect of the text's meaning. You never interpret these numbers directly — you compare them using cosine similarity.
To understand why this works, let's embed three sentences and measure how similar they are to each other. Two are about refunds, one is about the weather — the embeddings should reflect that:
The return/refund pair will score much higher (typically 0.85+) than either compared to the weather question (typically 0.3-0.5). This is the foundation of RAG: when the user asks a question, embed the question, then find the document chunks whose vectors are closest to the question vector.
Stage 4: Storing Vectors in ChromaDB
You have chunks and you can embed them. Now you need somewhere to store those vectors so you can search them later. That is what a vector store does — it is a database optimised for similarity search over high-dimensional vectors.
ChromaDB is the most common choice for development and small-to-medium production systems. It runs locally (no server needed), stores data on disk, and handles the embedding step for you if you pass it an embedding model. For larger production deployments, teams typically move to Pinecone, Weaviate, or pgvector — but the LangChain interface stays the same.
That single call did a lot: it embedded every chunk using the OpenAI embedding model, stored the vectors alongside the original text and metadata in ChromaDB, and saved everything to disk. Next time you restart your program, you can reload the store without re-embedding:
Now the critical part — searching. The similarity_search method embeds your query, finds the most similar chunk vectors, and returns the corresponding documents:
The k=3 parameter means "return the 3 most similar chunks." In practice, I usually retrieve 3-5 chunks. Too few and you might miss relevant context. Too many and you flood the prompt with marginally relevant text, which dilutes the answer quality.
For more control, use similarity_search_with_score to see the actual similarity scores. This helps you set a relevance threshold — chunks below a certain score are probably not relevant:
Stage 5: The Retrieval Chain — Asking Questions
We have documents in a vector store and we can search them. The final step connects retrieval to generation: search for relevant chunks, inject them into a prompt, and send that prompt to an LLM. LangChain calls this a retrieval chain.
I want to build this up in two steps. First, the manual way — so you see exactly what the chain is doing under the hood. Then the LangChain shorthand that does the same thing in three lines.
The Manual Approach
That is the entire RAG pattern. Retrieve relevant context, stuff it into a prompt template, and let the LLM generate from that context. The instruction "Answer based ONLY on the following context" is what prevents hallucination — the model is told to work from the provided documents, not from its training data.
The LangChain Way — RetrievalQA Chain
LangChain wraps this retrieve-stuff-generate pattern into a reusable chain. The RetrievalQA chain does exactly what we did manually, but it is composable and configurable:
The chain_type="stuff" parameter means "stuff all retrieved chunks into a single prompt." This works for most cases. For very large retrievals where the combined chunks exceed the context window, LangChain also offers map_reduce (summarise each chunk separately, then combine) and refine (iteratively improve the answer with each chunk).
Setting return_source_documents=True is something I do on every RAG chain. It returns the actual chunks the answer was based on, so you can show citations to the user. This builds trust — the user can verify the answer against the original text.
Real-World Example: Q&A Over Technical Documentation
Let's put every stage together in one complete script. We'll build a Q&A system over a collection of technical documents — the kind of thing I've built at least a dozen times for internal company knowledge bases.
Since not everyone has PDF files ready, we will create sample documents from strings. The pipeline is identical whether the text comes from files or from strings — the Document objects are the same:
Notice the last question: "What is the company's stock option plan?" Our documents say nothing about stock options. Because the prompt says "Answer using ONLY the provided context," the LLM should respond that it does not have that information. This is the hallucination guard in action — the prompt constrains the model to the retrieved documents.
Write a function chunk_document(text, chunk_size, overlap) that splits a text string into overlapping chunks. The function should:
1. Split the text into chunks of at most chunk_size characters
2. Each chunk (except the first) should overlap with the previous chunk by overlap characters
3. Return a list of chunk strings
For example, with chunk_size=10 and overlap=3, the text "abcdefghijklmno" should produce ["abcdefghij", "hijklmno"] (the second chunk starts 3 characters before the end of the first).
Tuning Retrieval — Getting the Right Chunks
The retriever is the most important component to get right. A perfect LLM fed irrelevant chunks will produce a useless answer. A mediocre LLM fed the right chunks will produce a great answer. This is why experienced RAG engineers spend 80% of their time on retrieval quality, not prompt engineering.
Maximum Marginal Relevance (MMR)
Standard similarity search has a problem: the top-k results often all say roughly the same thing. If your top 3 chunks are all from the same paragraph (just different overlapping windows), you are wasting 2 of your 3 slots. MMR fixes this by balancing relevance and diversity — each selected chunk must be relevant to the query but also different from chunks already selected.
The lambda_mult parameter controls the relevance-diversity tradeoff. At 1.0, MMR behaves like standard similarity search. At 0.0, it maximises diversity (which means some results may be less relevant). I find 0.5-0.7 works well for most use cases.
Metadata Filtering
When your vector store contains documents from multiple sources — say, HR policies and engineering docs — you often want to restrict search to a specific subset. ChromaDB supports metadata filtering so you can search only within relevant documents:
Implement a find_most_similar(query_vec, doc_vectors, doc_names, top_k) function that finds the most similar documents to a query using cosine similarity.
Cosine similarity between vectors A and B is: dot(A, B) / (norm(A) * norm(B))
The function should return a list of (name, score) tuples sorted by score (highest first), limited to top_k results.
Common Mistakes and How to Fix Them
Mistake 1: Chunks Too Small or Too Large
Chunks that are too small (50-100 characters) lose context. The embedding captures the meaning of a sentence fragment, but when that fragment is retrieved, it does not provide enough information for the LLM to generate a useful answer. Chunks that are too large (5000+ characters) create noisy embeddings that match too many queries.
splitter = RecursiveCharacterTextSplitter(
chunk_size=100, # fragments lose meaning
chunk_overlap=20
)
# Retrieved chunk: "up to 3 days per week"
# LLM: "3 days per week... of what?"splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # captures full paragraphs
chunk_overlap=150
)
# Retrieved chunk: "Remote Work Policy. Employees
# may work remotely up to 3 days per week..."Mistake 2: No Overlap Between Chunks
Without overlap, a sentence that falls on a chunk boundary gets split across two chunks. Neither chunk contains the complete sentence, so neither chunk will match a query about that sentence. Use 10-20% overlap as a baseline.
Mistake 3: Not Constraining the LLM to the Context
The single most common RAG failure is a prompt that does not explicitly tell the LLM to answer only from the provided context. Without this constraint, the model blends retrieved context with its training data, which can produce hallucinated details that sound authoritative.
prompt = """Answer the question.
Context: {context}
Question: {question}"""prompt = """Answer ONLY using the context below.
If the context does not contain the answer,
say "I don't have that information."
Context: {context}
Question: {question}"""Mistake 4: Using the Wrong Embedding Model for Queries vs Documents
The query and the documents must be embedded with the same model. If you embed your documents with text-embedding-3-small but your queries with text-embedding-ada-002, the vectors live in different spaces and similarity scores will be meaningless. LangChain prevents this by attaching the embedding model to the vector store, but if you build a custom pipeline, this is an easy mistake to make.
Mistake 5: Not Storing Metadata
Skipping metadata during ingestion seems harmless until you need to filter by document type, show source citations, or debug why a chunk was retrieved. Always attach at minimum the source filename and page/section number to every chunk.
What Would Be Different in Production?
The pipeline we built works for learning and prototyping. Production systems add several layers that are beyond this tutorial's scope but important to know about:
Write a function process_documents(docs, chunk_size, overlap) that takes a list of document dictionaries (each with "text" and "source" keys) and returns a list of chunk dictionaries.
Each chunk dictionary should have:
"text": the chunk text"source": the original document source"chunk_index": the chunk number within that document (starting at 0)Use your chunking logic from Exercise 1: split each document's text into overlapping chunks of the specified size.
Frequently Asked Questions
How many chunks should I retrieve (what value of k)?
Start with k=3 or k=4. Retrieve too few and you might miss context. Retrieve too many and irrelevant chunks dilute the answer. The optimal value depends on your chunk size and the breadth of questions. A practical approach: try k=3, k=5, and k=10 on a set of test questions and compare answer quality.
Can I use open-source embeddings instead of OpenAI?
Yes. LangChain supports HuggingFace embeddings (HuggingFaceEmbeddings), Sentence Transformers, Cohere, and others. The all-MiniLM-L6-v2 model from Sentence Transformers is a popular free alternative that runs locally. Swap the embedding class in LangChain and everything else in the pipeline stays the same:
How do I update documents after the vector store is created?
ChromaDB supports adding new documents with vectorstore.add_documents(new_chunks). For updates, delete the old chunks by their IDs and add the new versions. There is no built-in "update in place" — you delete and re-add. For production systems, track document hashes so you know which chunks to replace when a source document changes.
RAG vs fine-tuning — when should I use each?
Use RAG when your data changes frequently or you need to cite sources. Use fine-tuning when you need to change the model's behavior, tone, or output format. Many teams use both: fine-tune the model for style and domain language, then use RAG for grounding it in current documents. As a starting point, RAG is almost always the right first step — it is cheaper, faster to implement, and easier to debug than fine-tuning.