Build RAG with LangChain: The Standard Production Approach

Intermediate120 min3 exercises60 XP

Prerequisites

Python AI Chatbot Prompt Engineering Basics Chain-of-Thought Prompting

0/3 exercises

You ask an LLM about your company's refund policy and it confidently invents one. You paste a 200-page PDF into the prompt and blow past the context window. You split the PDF into chunks and the model ignores the relevant ones. These are the three problems that Retrieval-Augmented Generation solves, and LangChain is the framework most production teams reach for first.

What Is RAG and Why Does Every Production Team Need It?

RAG stands for Retrieval-Augmented Generation. Instead of asking an LLM to answer from its training data alone, you first retrieve the relevant documents from your own data, then augment the LLM's prompt with those documents, and finally the LLM generates an answer grounded in your actual content.

I think of it as giving the model an open-book exam. Without RAG, the model answers from memory — which means hallucinations about your specific data. With RAG, you hand it the exact pages it needs, and it synthesises an answer from those pages.

The RAG pipeline has five stages, and every production system follows this same structure:

The five stages of RAG

Loading editor...

LangChain is the most widely adopted framework for building these pipelines. It provides standardised interfaces for each stage — document loaders, text splitters, embedding models, vector stores, and retrieval chains — so you can swap components without rewriting your pipeline. Whether you use OpenAI or Anthropic, ChromaDB or Pinecone, the code structure stays the same.

Stage 1: Loading Documents into LangChain

Before the LLM can answer questions about your data, LangChain needs to read that data. LangChain provides over 100 document loaders — one for each source type. The loader reads the raw content and returns a list of Document objects, each containing the text and metadata (source file, page number, etc.).

Let's start with the simplest case: loading a text file. Then we'll add PDFs and web pages.

Loading a text file

Loading editor...

Each Document object has two fields: page_content (the actual text as a string) and metadata (a dictionary with source information). The metadata becomes important later when you want to show the user where the answer came from.

For PDFs, LangChain splits by page automatically. Each page becomes its own Document:

Loading a PDF document

Loading editor...

In production, I almost always load from multiple sources. LangChain's DirectoryLoader handles an entire folder at once. You specify a glob pattern and the loader class for that file type:

Loading an entire directory

Loading editor...

Stage 2: Splitting Documents into Chunks

This is where most RAG pipelines silently fail. A 50-page document is too large to fit in a single embedding or a single prompt. You have to split it into smaller chunks. But split badly — in the middle of a sentence, or with chunks so small they lose context — and your retrieval will return irrelevant fragments.

LangChain's RecursiveCharacterTextSplitter is the workhorse here. It tries to split on paragraph boundaries first, then sentences, then words — falling back to smaller separators only when necessary. Two parameters control everything: chunk_size (maximum characters per chunk) and chunk_overlap (how many characters overlap between adjacent chunks).

Splitting documents with RecursiveCharacterTextSplitter

Loading editor...

The separators list defines the split hierarchy. The splitter tries \n\n (paragraph breaks) first. If a paragraph is still larger than chunk_size, it falls back to \n (line breaks), then . (sentences), then spaces, and finally individual characters as a last resort.

The chunk_overlap parameter is subtle but critical. When you split a document at position 1000, the next chunk starts at position 800 (with 200-character overlap). This means a sentence that was cut at the boundary still appears complete in at least one chunk. Without overlap, you lose context at every split point.

Let me show you what happens with different chunk sizes on the same document — this is the kind of experiment I run on every new RAG project before committing to a configuration:

Experimenting with chunk sizes

Loading editor...

No overlap (context lost at boundaries)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0   # chunks are isolated islands
)
# Result: sentence cut at position 500 is
# incomplete in BOTH chunks

20% overlap (boundaries preserved)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100  # 100 chars shared
)
# Result: boundary sentence appears
# complete in at least one chunk

Stage 3: Embeddings — Converting Text to Vectors

Here is the core idea behind all of RAG retrieval: you convert text into a list of numbers (a vector) such that texts with similar meaning end up with similar vectors. "How do I return a product?" and "What is your refund policy?" have almost no words in common, but their embedding vectors will be very close — because they mean roughly the same thing.

This is fundamentally different from keyword search. A keyword search for "refund" would miss the first question entirely. An embedding-based search finds it instantly because the meaning is similar.

Creating embeddings with OpenAI

Loading editor...

The text-embedding-3-small model converts any text into a 1536-dimensional vector. Each number in that vector captures some aspect of the text's meaning. You never interpret these numbers directly — you compare them using cosine similarity.

To understand why this works, let's embed three sentences and measure how similar they are to each other. Two are about refunds, one is about the weather — the embeddings should reflect that:

Measuring semantic similarity with embeddings

Loading editor...

The return/refund pair will score much higher (typically 0.85+) than either compared to the weather question (typically 0.3-0.5). This is the foundation of RAG: when the user asks a question, embed the question, then find the document chunks whose vectors are closest to the question vector.

Stage 4: Storing Vectors in ChromaDB

You have chunks and you can embed them. Now you need somewhere to store those vectors so you can search them later. That is what a vector store does — it is a database optimised for similarity search over high-dimensional vectors.

ChromaDB is the most common choice for development and small-to-medium production systems. It runs locally (no server needed), stores data on disk, and handles the embedding step for you if you pass it an embedding model. For larger production deployments, teams typically move to Pinecone, Weaviate, or pgvector — but the LangChain interface stays the same.

Creating a ChromaDB vector store

Loading editor...

That single call did a lot: it embedded every chunk using the OpenAI embedding model, stored the vectors alongside the original text and metadata in ChromaDB, and saved everything to disk. Next time you restart your program, you can reload the store without re-embedding:

Reloading a persisted vector store

Loading editor...

Now the critical part — searching. The similarity_search method embeds your query, finds the most similar chunk vectors, and returns the corresponding documents:

Searching the vector store

Loading editor...

The k=3 parameter means "return the 3 most similar chunks." In practice, I usually retrieve 3-5 chunks. Too few and you might miss relevant context. Too many and you flood the prompt with marginally relevant text, which dilutes the answer quality.

For more control, use similarity_search_with_score to see the actual similarity scores. This helps you set a relevance threshold — chunks below a certain score are probably not relevant:

Search with similarity scores

Loading editor...

Stage 5: The Retrieval Chain — Asking Questions

We have documents in a vector store and we can search them. The final step connects retrieval to generation: search for relevant chunks, inject them into a prompt, and send that prompt to an LLM. LangChain calls this a retrieval chain.

I want to build this up in two steps. First, the manual way — so you see exactly what the chain is doing under the hood. Then the LangChain shorthand that does the same thing in three lines.

The Manual Approach

RAG the manual way — retrieve then generate

Loading editor...

That is the entire RAG pattern. Retrieve relevant context, stuff it into a prompt template, and let the LLM generate from that context. The instruction "Answer based ONLY on the following context" is what prevents hallucination — the model is told to work from the provided documents, not from its training data.

The LangChain Way — RetrievalQA Chain

LangChain wraps this retrieve-stuff-generate pattern into a reusable chain. The RetrievalQA chain does exactly what we did manually, but it is composable and configurable:

RAG with LangChain RetrievalQA chain

Loading editor...

The chain_type="stuff" parameter means "stuff all retrieved chunks into a single prompt." This works for most cases. For very large retrievals where the combined chunks exceed the context window, LangChain also offers map_reduce (summarise each chunk separately, then combine) and refine (iteratively improve the answer with each chunk).

Setting return_source_documents=True is something I do on every RAG chain. It returns the actual chunks the answer was based on, so you can show citations to the user. This builds trust — the user can verify the answer against the original text.

Real-World Example: Q&A Over Technical Documentation

Let's put every stage together in one complete script. We'll build a Q&A system over a collection of technical documents — the kind of thing I've built at least a dozen times for internal company knowledge bases.

Since not everyone has PDF files ready, we will create sample documents from strings. The pipeline is identical whether the text comes from files or from strings — the Document objects are the same:

Complete RAG pipeline — Step 1: Documents

Loading editor...

Complete RAG pipeline — Steps 2-5: Split, embed, store, chain

Loading editor...

Complete RAG pipeline — Asking questions

Loading editor...

Notice the last question: "What is the company's stock option plan?" Our documents say nothing about stock options. Because the prompt says "Answer using ONLY the provided context," the LLM should respond that it does not have that information. This is the hallucination guard in action — the prompt constrains the model to the retrieved documents.

Exercise 1: Document Chunking Logic

Write Code

Write a function chunk_document(text, chunk_size, overlap) that splits a text string into overlapping chunks. The function should:

1. Split the text into chunks of at most chunk_size characters

2. Each chunk (except the first) should overlap with the previous chunk by overlap characters

3. Return a list of chunk strings

For example, with chunk_size=10 and overlap=3, the text "abcdefghijklmno" should produce ["abcdefghij", "hijklmno"] (the second chunk starts 3 characters before the end of the first).

Loading editor...

Tuning Retrieval — Getting the Right Chunks

The retriever is the most important component to get right. A perfect LLM fed irrelevant chunks will produce a useless answer. A mediocre LLM fed the right chunks will produce a great answer. This is why experienced RAG engineers spend 80% of their time on retrieval quality, not prompt engineering.

Maximum Marginal Relevance (MMR)

Standard similarity search has a problem: the top-k results often all say roughly the same thing. If your top 3 chunks are all from the same paragraph (just different overlapping windows), you are wasting 2 of your 3 slots. MMR fixes this by balancing relevance and diversity — each selected chunk must be relevant to the query but also different from chunks already selected.

Comparing similarity search vs MMR retrieval

Loading editor...

The lambda_mult parameter controls the relevance-diversity tradeoff. At 1.0, MMR behaves like standard similarity search. At 0.0, it maximises diversity (which means some results may be less relevant). I find 0.5-0.7 works well for most use cases.

Metadata Filtering

When your vector store contains documents from multiple sources — say, HR policies and engineering docs — you often want to restrict search to a specific subset. ChromaDB supports metadata filtering so you can search only within relevant documents:

Filtering by document metadata

Loading editor...

Exercise 2: Implement Cosine Similarity Search

Write Code

Implement a find_most_similar(query_vec, doc_vectors, doc_names, top_k) function that finds the most similar documents to a query using cosine similarity.

Cosine similarity between vectors A and B is: dot(A, B) / (norm(A) * norm(B))

The function should return a list of (name, score) tuples sorted by score (highest first), limited to top_k results.

Loading editor...

Common Mistakes and How to Fix Them

Mistake 1: Chunks Too Small or Too Large

Chunks that are too small (50-100 characters) lose context. The embedding captures the meaning of a sentence fragment, but when that fragment is retrieved, it does not provide enough information for the LLM to generate a useful answer. Chunks that are too large (5000+ characters) create noisy embeddings that match too many queries.

Too small — lost context

splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,    # fragments lose meaning
    chunk_overlap=20
)
# Retrieved chunk: "up to 3 days per week"
# LLM: "3 days per week... of what?"

Right size — complete thoughts

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,    # captures full paragraphs
    chunk_overlap=150
)
# Retrieved chunk: "Remote Work Policy. Employees
# may work remotely up to 3 days per week..."

Mistake 2: No Overlap Between Chunks

Without overlap, a sentence that falls on a chunk boundary gets split across two chunks. Neither chunk contains the complete sentence, so neither chunk will match a query about that sentence. Use 10-20% overlap as a baseline.

Mistake 3: Not Constraining the LLM to the Context

The single most common RAG failure is a prompt that does not explicitly tell the LLM to answer only from the provided context. Without this constraint, the model blends retrieved context with its training data, which can produce hallucinated details that sound authoritative.

Unconstrained prompt (hallucination risk)

prompt = """Answer the question.

Context: {context}
Question: {question}"""

Constrained prompt (grounded answers)

prompt = """Answer ONLY using the context below.
If the context does not contain the answer,
say "I don't have that information."

Context: {context}
Question: {question}"""

Mistake 4: Using the Wrong Embedding Model for Queries vs Documents

The query and the documents must be embedded with the same model. If you embed your documents with text-embedding-3-small but your queries with text-embedding-ada-002, the vectors live in different spaces and similarity scores will be meaningless. LangChain prevents this by attaching the embedding model to the vector store, but if you build a custom pipeline, this is an easy mistake to make.

Mistake 5: Not Storing Metadata

Skipping metadata during ingestion seems harmless until you need to filter by document type, show source citations, or debug why a chunk was retrieved. Always attach at minimum the source filename and page/section number to every chunk.

What Would Be Different in Production?

The pipeline we built works for learning and prototyping. Production systems add several layers that are beyond this tutorial's scope but important to know about:

Managed vector stores: Replace local ChromaDB with Pinecone, Weaviate, or pgvector for multi-user access, automatic scaling, and backup.

Hybrid search: Combine vector similarity with keyword (BM25) search. Some queries need exact keyword matches that embedding search misses.

Re-ranking: After retrieving top-20 candidates by vector similarity, run a cross-encoder re-ranker to re-score them more accurately. Libraries like Cohere Rerank or a local cross-encoder model do this.

Document versioning: When documents update, you need to re-embed only the changed chunks. Track document hashes and last-modified dates.

Evaluation: Measure retrieval quality (are the right chunks retrieved?) and generation quality (is the answer correct?). LangSmith and RAGAS are popular evaluation frameworks.

Guardrails: Add input validation, output filtering, and response verification to prevent prompt injection and catch hallucinations.

Exercise 3: Build a Document Processor

Write Code

Write a function process_documents(docs, chunk_size, overlap) that takes a list of document dictionaries (each with "text" and "source" keys) and returns a list of chunk dictionaries.

Each chunk dictionary should have:

"text": the chunk text

"source": the original document source

"chunk_index": the chunk number within that document (starting at 0)

Use your chunking logic from Exercise 1: split each document's text into overlapping chunks of the specified size.

Loading editor...

Frequently Asked Questions

How many chunks should I retrieve (what value of k)?

Start with k=3 or k=4. Retrieve too few and you might miss context. Retrieve too many and irrelevant chunks dilute the answer. The optimal value depends on your chunk size and the breadth of questions. A practical approach: try k=3, k=5, and k=10 on a set of test questions and compare answer quality.

Can I use open-source embeddings instead of OpenAI?

Yes. LangChain supports HuggingFace embeddings (HuggingFaceEmbeddings), Sentence Transformers, Cohere, and others. The all-MiniLM-L6-v2 model from Sentence Transformers is a popular free alternative that runs locally. Swap the embedding class in LangChain and everything else in the pipeline stays the same:

Using HuggingFace embeddings with LangChain

Loading editor...

How do I update documents after the vector store is created?

ChromaDB supports adding new documents with vectorstore.add_documents(new_chunks). For updates, delete the old chunks by their IDs and add the new versions. There is no built-in "update in place" — you delete and re-add. For production systems, track document hashes so you know which chunks to replace when a source document changes.

RAG vs fine-tuning — when should I use each?

Use RAG when your data changes frequently or you need to cite sources. Use fine-tuning when you need to change the model's behavior, tone, or output format. Many teams use both: fine-tune the model for style and domain language, then use RAG for grounding it in current documents. As a starting point, RAG is almost always the right first step — it is cheaper, faster to implement, and easier to debug than fine-tuning.

Complete Code

Complete RAG pipeline — copy-paste and run

Loading editor...

References

LangChain documentation — Retrieval. Link

ChromaDB documentation. Link

Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. arXiv:2005.11401

OpenAI Embeddings documentation. Link

LangChain documentation — Text Splitters. Link

LangChain documentation — Vector Stores. Link

Gao, Y. et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv:2312.10997

What Is RAG and Why Does Every Production Team Need It?

Stage 1: Loading Documents into LangChain

Stage 2: Splitting Documents into Chunks

Stage 3: Embeddings — Converting Text to Vectors

Stage 4: Storing Vectors in ChromaDB

Stage 5: The Retrieval Chain — Asking Questions

The Manual Approach

The LangChain Way — RetrievalQA Chain

Real-World Example: Q&A Over Technical Documentation

Tuning Retrieval — Getting the Right Chunks

Maximum Marginal Relevance (MMR)

Metadata Filtering

Common Mistakes and How to Fix Them

Mistake 1: Chunks Too Small or Too Large

Mistake 2: No Overlap Between Chunks

Mistake 3: Not Constraining the LLM to the Context

Mistake 4: Using the Wrong Embedding Model for Queries vs Documents

Mistake 5: Not Storing Metadata

What Would Be Different in Production?

Frequently Asked Questions

How many chunks should I retrieve (what value of k)?

Can I use open-source embeddings instead of OpenAI?

How do I update documents after the vector store is created?

RAG vs fine-tuning — when should I use each?

Complete Code

References

Related Tutorials