RAG Explained: Why LLMs Need External Knowledge and How Retrieval Fixes It
Ask an LLM about your company's Q3 revenue and you'll get a confident answer that is completely made up. The model has never seen your internal documents, yet it won't say "I don't know." Retrieval-Augmented Generation (RAG) is the standard fix: find the right documents before the model answers, so it cites facts instead of fabricating them.
The Knowledge Problem — Why LLMs Make Things Up
Here's a question that trips up every LLM: "What were Acme Corp's sales in Q3 2025?" The model has never seen Acme Corp's financials. But instead of admitting that, it generates a paragraph with plausible-sounding numbers and trend narratives — all fabricated.
This behaviour is called hallucination, and it isn't a bug that will get patched. An LLM predicts the most probable next token given everything before it. It has no internal fact database and no way to flag "I wasn't trained on this." When the prompt expects facts that aren't in the model's weights, it generates statistically plausible text anyway.
Three situations make this especially dangerous:
So we have a powerful text generator that can reason, summarise, and write code. But it can't reliably answer questions about your data. We need to feed the relevant facts into the model's context at query time — and that is exactly what RAG does.
What Is RAG? The Core Idea in 60 Seconds
The idea is embarrassingly simple. Before asking the LLM a question, search your own documents for the most relevant passages and paste them into the prompt. The model now has the facts it needs and can answer grounded in real data instead of guessing.
A RAG pipeline has five stages:
I think of it as an open-book exam. Without RAG, the student answers from memory — unreliable. With RAG, you hand them the textbook pages relevant to each question. They still need to read and reason, but now they're working from real sources.
Embeddings — Turning Text into Searchable Vectors
Before we can search documents by meaning, we need a way to measure how similar two pieces of text are. The word "bank" in "river bank" and "bank account" should be far apart. "Dog" and "puppy" should be close together.
An embedding is a list of numbers (a vector) that represents the meaning of text. Embedding models are trained so that texts with similar meanings produce similar vectors. The key insight: comparing meanings becomes comparing vectors mathematically.
The cat sentences have vectors pointing in roughly the same direction. The stock sentence points somewhere completely different. Similar meanings produce similar directions — that is how embedding models encode semantics.
Cosine Similarity — Measuring How Close Two Vectors Are
Cosine similarity measures whether two vectors point in the same direction. It ranges from -1 (opposite) through 0 (unrelated) to 1 (identical direction). For text embeddings, values between 0 and 1 are typical.
The formula: take the dot product of two vectors and divide by the product of their lengths. Here it is from scratch:
The cat-kitten pair scores close to 1.0. Both cat/kitten vs. the stock sentence score much lower. High similarity means "these texts discuss the same topic" — exactly the behaviour we need for document search.
Write a function called find_most_similar that takes three arguments:
1. query_vec — a list of floats (the query embedding)
2. doc_vecs — a list of lists of floats (document embeddings)
3. top_k — an integer for how many results to return (default 1)
The function should compute cosine similarity between query_vec and each vector in doc_vecs, then return a list of tuples (index, score) sorted by score descending, limited to top_k results.
Building a Tiny RAG Pipeline from Scratch
Enough theory — let's build a working RAG system in pure Python. No external libraries, no API calls, no vector database. The goal is to see every moving part clearly, so you understand what frameworks do under the hood.
We'll work with a small knowledge base of five solar system facts. When the user asks a question, our pipeline finds the most relevant fact and presents it.
Step 1 — The Knowledge Base
Real embedding models produce vectors with hundreds of dimensions. For clarity, we'll use hand-crafted 5-dimensional vectors that loosely capture each fact's topic.
Step 2 — The Retriever
The retriever compares a query vector against every document and returns the top matches. Production systems use optimised indices (FAISS, Pinecone) for millions of documents. For five, a simple loop works perfectly.
Jupiter surfaces as the top result because its embedding strongly activates the "size" dimension. In a real system, the embedding model handles this mapping automatically — you pass in query text and get a vector back.
Step 3 — Augmenting the Prompt
This is the step that makes RAG work. We inject the retrieved documents into the prompt as context, so the LLM answers using those facts instead of its training data.
Step 4 — The Complete Pipeline
Let's wire everything together and test with different queries:
Each query pulls different documents. The temperature question surfaces the Sun fact; the moons question surfaces Saturn. The right context for the right question, every time — that is the essence of RAG.
Write a function called build_cited_prompt that takes two arguments:
1. question — a string with the user's question
2. sources — a list of dictionaries, each with keys "text" and "title"
The function should return a string with:
"Answer the question using ONLY the sources below. Cite sources by number.""[1] Title: text""Question: " + the question"Answer:"Scaling Up — Vector Search with NumPy
Hand-crafted embeddings helped us see the mechanics. Real systems use vectors with hundreds of dimensions and search across thousands of documents. NumPy handles this efficiently. Let's simulate realistic embeddings to demonstrate search at a larger scale.
The ML-related documents (indices 1, 3, 5, 7, 9) share a common signal in their embeddings. This mimics how a real embedding model would place topically related documents near each other in vector space.
ML-related documents cluster near the top because they share embedding signal. Real embedding models learn these relationships from training data. The key point: vector search finds semantically related content even when exact words don't match.
RAG vs Fine-Tuning vs Long Context
RAG isn't the only way to get custom knowledge into an LLM. Two alternatives come up constantly: fine-tuning the model and stuffing everything into a long context window. Each solves a different problem, and I've seen teams waste months picking the wrong one.
My rule of thumb: start with RAG. It's cheapest to set up, easiest to update, and handles the most common use case well. Move to fine-tuning only when you need the model to adopt a specific tone that prompt engineering can't achieve. Use long context for one-off analysis of a small document set.
# Train a custom model on company docs
# Cost: $500+ in GPU compute
# Time: 2-3 days
# Result: model "memorises" facts poorly
# Cannot update without retraining
# fine_tune(base_model, company_docs)
# >> Still hallucinates on recent data# Index company docs as embeddings
# Cost: ~$0.01 for 1000 pages
# Time: minutes
# Result: always retrieves current facts
# Updates instantly when docs change
# rag_query("Q3 revenue?", doc_index)
# >> Cites the actual Q3 reportWhat a Production RAG System Looks Like
Our toy pipeline captured the core logic. Production systems add a few more layers to handle real-world complexity:
Chunking — The Most Underestimated Step
How you split documents matters more than most people realise. I've seen RAG quality jump 20-30% from chunking changes alone, with zero model changes. Chunk too large and you waste context on irrelevant text. Chunk too small and you lose surrounding context.
The overlap parameter ensures sentences at chunk boundaries appear in at least one chunk. Without overlap, a key fact sitting right at a boundary might be split in half and lost to retrieval.
Write a function called simple_chunker that takes:
1. text — a string to split
2. max_chars — maximum characters per chunk (default 100)
The function should split text into chunks of at most max_chars characters, breaking only at space boundaries (never in the middle of a word). Return a list of strings.
Rules:
max_chars, start a new chunk.Common RAG Mistakes and How to Avoid Them
After reviewing dozens of RAG implementations, the same mistakes keep coming back. Here are the ones that cause real damage:
Mistake 1: Chunks that are too large. When chunks exceed 2000 tokens, retrieval becomes imprecise. The embedding captures the average topic of the chunk, not any specific fact. The model then has to sift through noise to find the actual answer.
# Entire page as one chunk
chunk = full_page # 5000 tokens
# Embedding captures average topic
# Retrieval returns vaguely related text
# Model drowns in irrelevant context# 200-500 token chunks with overlap
chunks = chunk_text(page, 300, 50)
# Each embedding captures one topic
# Retrieval is precise
# Model gets focused contextMistake 2: No fallback instruction in the prompt. Without telling the model to say "I don't know," it generates from training data when retrieved context doesn't cover the question. This is the exact problem RAG was supposed to solve.
Mistake 3: Blaming the LLM when retrieval failed. In my experience, 80% of bad RAG answers trace back to the retriever. The right document never made it into the prompt. Always check retrieval quality before debugging generation.
Frequently Asked Questions
How much data can a RAG system handle?
Vector databases like Pinecone and Weaviate scale to billions of vectors. The bottleneck is usually the initial embedding step, not search. A million documents might take a few hours to embed; searching takes milliseconds.
Do I need a vector database?
For under 100,000 documents, NumPy or FAISS in memory works fine. For larger datasets, a dedicated vector database adds indexing structures (HNSW, IVF) that make search sublinear. PostgreSQL with pgvector is a solid middle ground.
Can I combine keyword search and vector search?
Yes — this is called hybrid search. Vector search finds semantically similar content ("automobile" matches "car"). Keyword search catches exact terms vector search might rank lower: product IDs, acronyms, proper nouns. Most production systems use both.
Does RAG completely eliminate hallucination?
No. RAG reduces it significantly, but the model can still hallucinate in two ways: (1) the retriever misses the right chunk, so the model falls back to training data, and (2) the model extends beyond what the context says. The fallback instruction and careful prompt engineering mitigate both.