Skip to main content

RAG Explained: Why LLMs Need External Knowledge and How Retrieval Fixes It

Beginner45 min3 exercises60 XP
0/3 exercises

Ask an LLM about your company's Q3 revenue and you'll get a confident answer that is completely made up. The model has never seen your internal documents, yet it won't say "I don't know." Retrieval-Augmented Generation (RAG) is the standard fix: find the right documents before the model answers, so it cites facts instead of fabricating them.

The Knowledge Problem — Why LLMs Make Things Up

Here's a question that trips up every LLM: "What were Acme Corp's sales in Q3 2025?" The model has never seen Acme Corp's financials. But instead of admitting that, it generates a paragraph with plausible-sounding numbers and trend narratives — all fabricated.

This behaviour is called hallucination, and it isn't a bug that will get patched. An LLM predicts the most probable next token given everything before it. It has no internal fact database and no way to flag "I wasn't trained on this." When the prompt expects facts that aren't in the model's weights, it generates statistically plausible text anyway.

Three situations make this especially dangerous:

  • Private data — company docs, internal wikis, customer records the model was never trained on.
  • Recent information — anything after the model's training cutoff.
  • Niche domain knowledge — specialised technical or regulatory content covered thinly in training data.
  • So we have a powerful text generator that can reason, summarise, and write code. But it can't reliably answer questions about your data. We need to feed the relevant facts into the model's context at query time — and that is exactly what RAG does.

    What Is RAG? The Core Idea in 60 Seconds

    The idea is embarrassingly simple. Before asking the LLM a question, search your own documents for the most relevant passages and paste them into the prompt. The model now has the facts it needs and can answer grounded in real data instead of guessing.

    A RAG pipeline has five stages:

  • Chunk — split documents into small pieces (paragraphs, sections, or fixed-size windows).
  • Embed — convert each chunk into a numerical vector that captures its meaning.
  • Store — save those vectors in a searchable index.
  • Retrieve — embed the user's question, then find the chunks whose vectors are closest.
  • Generate — paste the retrieved chunks into the prompt and let the LLM answer using those facts.
  • I think of it as an open-book exam. Without RAG, the student answers from memory — unreliable. With RAG, you hand them the textbook pages relevant to each question. They still need to read and reason, but now they're working from real sources.

    Embeddings — Turning Text into Searchable Vectors

    Before we can search documents by meaning, we need a way to measure how similar two pieces of text are. The word "bank" in "river bank" and "bank account" should be far apart. "Dog" and "puppy" should be close together.

    An embedding is a list of numbers (a vector) that represents the meaning of text. Embedding models are trained so that texts with similar meanings produce similar vectors. The key insight: comparing meanings becomes comparing vectors mathematically.

    What embeddings look like
    Loading editor...

    The cat sentences have vectors pointing in roughly the same direction. The stock sentence points somewhere completely different. Similar meanings produce similar directions — that is how embedding models encode semantics.

    Cosine Similarity — Measuring How Close Two Vectors Are

    Cosine similarity measures whether two vectors point in the same direction. It ranges from -1 (opposite) through 0 (unrelated) to 1 (identical direction). For text embeddings, values between 0 and 1 are typical.

    The formula: take the dot product of two vectors and divide by the product of their lengths. Here it is from scratch:

    Cosine similarity from scratch
    Loading editor...

    The cat-kitten pair scores close to 1.0. Both cat/kitten vs. the stock sentence score much lower. High similarity means "these texts discuss the same topic" — exactly the behaviour we need for document search.


    Build a Cosine Similarity Retriever
    Write Code

    Write a function called find_most_similar that takes three arguments:

    1. query_vec — a list of floats (the query embedding)

    2. doc_vecs — a list of lists of floats (document embeddings)

    3. top_k — an integer for how many results to return (default 1)

    The function should compute cosine similarity between query_vec and each vector in doc_vecs, then return a list of tuples (index, score) sorted by score descending, limited to top_k results.

    Loading editor...

    Building a Tiny RAG Pipeline from Scratch

    Enough theory — let's build a working RAG system in pure Python. No external libraries, no API calls, no vector database. The goal is to see every moving part clearly, so you understand what frameworks do under the hood.

    We'll work with a small knowledge base of five solar system facts. When the user asks a question, our pipeline finds the most relevant fact and presents it.

    Step 1 — The Knowledge Base

    Real embedding models produce vectors with hundreds of dimensions. For clarity, we'll use hand-crafted 5-dimensional vectors that loosely capture each fact's topic.

    Step 1: Knowledge base with hand-crafted embeddings
    Loading editor...

    Step 2 — The Retriever

    The retriever compares a query vector against every document and returns the top matches. Production systems use optimised indices (FAISS, Pinecone) for millions of documents. For five, a simple loop works perfectly.

    Step 2: Retriever function
    Loading editor...

    Jupiter surfaces as the top result because its embedding strongly activates the "size" dimension. In a real system, the embedding model handles this mapping automatically — you pass in query text and get a vector back.

    Step 3 — Augmenting the Prompt

    This is the step that makes RAG work. We inject the retrieved documents into the prompt as context, so the LLM answers using those facts instead of its training data.

    Step 3: Building the augmented prompt
    Loading editor...

    Step 4 — The Complete Pipeline

    Let's wire everything together and test with different queries:

    The complete pipeline end-to-end
    Loading editor...

    Each query pulls different documents. The temperature question surfaces the Sun fact; the moons question surfaces Saturn. The right context for the right question, every time — that is the essence of RAG.


    Build a RAG Prompt with Source Citations
    Write Code

    Write a function called build_cited_prompt that takes two arguments:

    1. question — a string with the user's question

    2. sources — a list of dictionaries, each with keys "text" and "title"

    The function should return a string with:

  • Line 1: "Answer the question using ONLY the sources below. Cite sources by number."
  • Line 2: empty
  • Then each source numbered like "[1] Title: text"
  • An empty line
  • "Question: " + the question
  • "Answer:"
  • Loading editor...

    Scaling Up — Vector Search with NumPy

    Hand-crafted embeddings helped us see the mechanics. Real systems use vectors with hundreds of dimensions and search across thousands of documents. NumPy handles this efficiently. Let's simulate realistic embeddings to demonstrate search at a larger scale.

    Simulating realistic embeddings with NumPy
    Loading editor...

    The ML-related documents (indices 1, 3, 5, 7, 9) share a common signal in their embeddings. This mimics how a real embedding model would place topically related documents near each other in vector space.

    Vector search across all documents
    Loading editor...

    ML-related documents cluster near the top because they share embedding signal. Real embedding models learn these relationships from training data. The key point: vector search finds semantically related content even when exact words don't match.

    RAG vs Fine-Tuning vs Long Context

    RAG isn't the only way to get custom knowledge into an LLM. Two alternatives come up constantly: fine-tuning the model and stuffing everything into a long context window. Each solves a different problem, and I've seen teams waste months picking the wrong one.

    Choosing between RAG, fine-tuning, and long context
    Loading editor...

    My rule of thumb: start with RAG. It's cheapest to set up, easiest to update, and handles the most common use case well. Move to fine-tuning only when you need the model to adopt a specific tone that prompt engineering can't achieve. Use long context for one-off analysis of a small document set.

    Fine-tuning (wrong tool for factual Q&A)
    # Train a custom model on company docs
    # Cost: $500+ in GPU compute
    # Time: 2-3 days
    # Result: model "memorises" facts poorly
    #   Cannot update without retraining
    
    # fine_tune(base_model, company_docs)
    # >> Still hallucinates on recent data
    RAG (right tool for factual Q&A)
    # Index company docs as embeddings
    # Cost: ~$0.01 for 1000 pages
    # Time: minutes
    # Result: always retrieves current facts
    #   Updates instantly when docs change
    
    # rag_query("Q3 revenue?", doc_index)
    # >> Cites the actual Q3 report

    What a Production RAG System Looks Like

    Our toy pipeline captured the core logic. Production systems add a few more layers to handle real-world complexity:

    The production RAG stack
    Loading editor...

    Chunking — The Most Underestimated Step

    How you split documents matters more than most people realise. I've seen RAG quality jump 20-30% from chunking changes alone, with zero model changes. Chunk too large and you waste context on irrelevant text. Chunk too small and you lose surrounding context.

    Text chunking with overlap
    Loading editor...

    The overlap parameter ensures sentences at chunk boundaries appear in at least one chunk. Without overlap, a key fact sitting right at a boundary might be split in half and lost to retrieval.


    Implement a Simple Text Chunker
    Write Code

    Write a function called simple_chunker that takes:

    1. text — a string to split

    2. max_chars — maximum characters per chunk (default 100)

    The function should split text into chunks of at most max_chars characters, breaking only at space boundaries (never in the middle of a word). Return a list of strings.

    Rules:

  • If a word would push a chunk past max_chars, start a new chunk.
  • Strip leading/trailing whitespace from each chunk.
  • Do not return empty chunks.
  • Loading editor...

    Common RAG Mistakes and How to Avoid Them

    After reviewing dozens of RAG implementations, the same mistakes keep coming back. Here are the ones that cause real damage:

    Mistake 1: Chunks that are too large. When chunks exceed 2000 tokens, retrieval becomes imprecise. The embedding captures the average topic of the chunk, not any specific fact. The model then has to sift through noise to find the actual answer.

    Chunks too large (imprecise)
    # Entire page as one chunk
    chunk = full_page  # 5000 tokens
    # Embedding captures average topic
    # Retrieval returns vaguely related text
    # Model drowns in irrelevant context
    Right-sized chunks (precise)
    # 200-500 token chunks with overlap
    chunks = chunk_text(page, 300, 50)
    # Each embedding captures one topic
    # Retrieval is precise
    # Model gets focused context

    Mistake 2: No fallback instruction in the prompt. Without telling the model to say "I don't know," it generates from training data when retrieved context doesn't cover the question. This is the exact problem RAG was supposed to solve.

    Mistake 3: Blaming the LLM when retrieval failed. In my experience, 80% of bad RAG answers trace back to the retriever. The right document never made it into the prompt. Always check retrieval quality before debugging generation.

    Frequently Asked Questions

    How much data can a RAG system handle?

    Vector databases like Pinecone and Weaviate scale to billions of vectors. The bottleneck is usually the initial embedding step, not search. A million documents might take a few hours to embed; searching takes milliseconds.

    Do I need a vector database?

    For under 100,000 documents, NumPy or FAISS in memory works fine. For larger datasets, a dedicated vector database adds indexing structures (HNSW, IVF) that make search sublinear. PostgreSQL with pgvector is a solid middle ground.

    Can I combine keyword search and vector search?

    Yes — this is called hybrid search. Vector search finds semantically similar content ("automobile" matches "car"). Keyword search catches exact terms vector search might rank lower: product IDs, acronyms, proper nouns. Most production systems use both.

    Does RAG completely eliminate hallucination?

    No. RAG reduces it significantly, but the model can still hallucinate in two ways: (1) the retriever misses the right chunk, so the model falls back to training data, and (2) the model extends beyond what the context says. The fallback instruction and careful prompt engineering mitigate both.

    References

  • Lewis, P. et al. — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. arXiv:2005.11401
  • OpenAI — Embeddings Guide. Link
  • Sentence Transformers — Pretrained Models. Link
  • Pinecone — What is RAG? Link
  • LangChain — RAG Tutorial. Link
  • Related Tutorials