Skip to main content

Build RAG from Scratch: No Frameworks, Just Python + OpenAI + FAISS

Intermediate120 min3 exercises55 XP
0/3 exercises

Ask ChatGPT about your company's internal docs and it will confidently make things up. It has no access to your data. RAG — Retrieval-Augmented Generation — fixes this by fetching relevant documents before the LLM generates an answer. Most tutorials hand you LangChain and say "trust the abstractions." We're going to build every piece ourselves so you actually understand what those abstractions hide.

What Is RAG and Why Build It from Scratch?

RAG is a two-step pattern: retrieve context from your own data, then generate an answer using that context. The LLM never sees your entire knowledge base — just the few chunks most relevant to the question.

I spent months using LangChain before I actually understood what was happening under the hood. When retrieval quality dropped, I had no idea which layer to debug — the chunking? The embeddings? The prompt template? Building RAG from raw components taught me more in a weekend than six months of framework-driven development.

Here's the full pipeline we'll build, piece by piece:

The RAG pipeline at a glance
Loading editor...

By the end of this tutorial, each of those boxes will be a function you wrote and understand completely. No magic, no hidden prompts, no framework lock-in.

Preparing a Knowledge Base

Every RAG system starts with documents. Ours will be a small collection of paragraphs about Python — short enough to read in full, long enough to make retrieval meaningful. In production you'd load PDFs, web pages, or database records. The pattern is identical.

Building a sample knowledge base
Loading editor...

Five documents, each a paragraph long. Small enough to inspect manually, which is exactly what you want when building a pipeline — you can verify every step by eye.

Text Chunking — Breaking Documents into Bite-Sized Pieces

Why not embed entire documents? Two reasons. First, embedding models have token limits (8,191 tokens for OpenAI's text-embedding-3-small). Second — and more importantly — a long document about five different topics will produce a single vector that's the average of all five topics. A question about just one of those topics won't match well. Smaller chunks give you sharper retrieval.

The simplest chunking strategy splits text by character count with overlap. The overlap ensures you don't cut a sentence in half and lose meaning at the boundary.

A simple character-based chunker
Loading editor...

Character-based splitting works but it's blunt — it can cut words in half. A better approach splits on sentence boundaries first, then groups sentences into chunks that fit the size limit.

Sentence-aware chunking with overlap
Loading editor...

Turning Text into Vectors — Embeddings

An embedding is a list of numbers (a vector) that captures the meaning of a piece of text. Two texts about similar topics will have vectors that point in similar directions. Two texts about unrelated topics will point in different directions. This is the core mechanic that makes RAG work — we find relevant chunks by measuring which chunk vectors are closest to the question vector.

Getting your first embedding from OpenAI
Loading editor...

Each text becomes a 1,536-dimensional vector (for text-embedding-3-small). You can't visualize 1,536 dimensions, but the math works the same as in 2D or 3D — vectors that point in similar directions represent similar meanings.

Time to embed all our chunks. We can batch them into a single API call since OpenAI's embedding endpoint accepts a list.

Embedding all chunks at once
Loading editor...

We have vectors for every chunk. Now we need to find which chunks are closest to a given question. The standard measure is cosine similarity — it compares the angle between two vectors, ignoring their magnitude. A cosine similarity of 1.0 means identical direction (same meaning), 0.0 means unrelated, and -1.0 means opposite.

Cosine similarity in NumPy
Loading editor...

Chunks from the same document (same topic) score higher than chunks from different documents. That's exactly the signal we need for retrieval.

Let's build a proper search function that ranks all chunks against a query and returns the top matches.

Searching the vector store
Loading editor...

The generator chunks should rank at the top — they are semantically closest to the question. The retrieval step is done. We found the relevant context without keyword matching, without full-text search, without any framework.

Grounded Generation — Answering with Retrieved Context

This is where the magic comes together. We take the retrieved chunks, inject them into a prompt, and ask the LLM to answer only based on that context. The system prompt is critical — it tells the model to stay grounded in the provided context and admit when the answer isn't there.

Building the RAG prompt
Loading editor...

That system message is doing heavy lifting. Without it, the model might ignore the context and answer from its training data — defeating the entire purpose of RAG.

The complete RAG pipeline
Loading editor...

That's a working RAG system in about 60 lines of actual logic. The answer is grounded in your documents, cites its sources, and won't hallucinate about topics not in the knowledge base.

Build a Word-Count Chunker
Write Code

Write a function chunk_by_words(text, max_words=50, overlap_words=10) that splits text into chunks of at most max_words words, with overlap_words words of overlap between consecutive chunks. Return a list of strings.

Example: if max_words=4 and overlap_words=1 and the text is "the quick brown fox jumps over", the chunks should be ["the quick brown fox", "fox jumps over"].

Loading editor...

Testing Retrieval Quality — Does It Actually Find the Right Chunks?

A RAG system is only as good as its retrieval step. If the wrong chunks land in the prompt, the LLM will produce confidently wrong answers — and you'll blame the model when the real culprit is your search. I always test retrieval separately before touching generation. Here's a simple evaluation function that checks whether the expected document appears in the top results.

Evaluating retrieval accuracy
Loading editor...

On this small knowledge base you should see 100% accuracy — each question retrieves chunks from the right document. In production with thousands of documents, this evaluation loop is how you catch problems early. When accuracy drops, you know to adjust chunk size, overlap, or embedding model before touching the generation step.

When the Answer Isn't in Your Documents

One of the most important behaviors to test: what happens when someone asks a question your knowledge base can't answer? A well-designed RAG system should say "I don't know" rather than hallucinate.

Testing with an out-of-scope question
Loading editor...

The similarity scores for the retrieved chunks should be noticeably lower than for on-topic questions. The system prompt tells the LLM to acknowledge when the context is insufficient. This is a feature, not a limitation — you want your RAG to be honest about its knowledge boundaries.

Implement Cosine Similarity from Scratch
Write Code

Write a function cosine_sim(a, b) that computes the cosine similarity between two lists of numbers. Do NOT use NumPy — use only built-in Python. Cosine similarity = dot product / (magnitude_a * magnitude_b). Return the result as a float rounded to 4 decimal places.

Formula: cos(a, b) = sum(a_i * b_i) / (sqrt(sum(a_i^2)) * sqrt(sum(b_i^2)))

Loading editor...

Scaling Up — FAISS for Production Vector Search

Our NumPy cosine similarity search works perfectly for a few hundred chunks. But it compares the query against every chunk — that's O(n) per query. With 100,000 documents and 500,000 chunks, each search takes noticeable time. This is where FAISS (Facebook AI Similarity Search) comes in.

FAISS builds an index structure that enables approximate nearest-neighbor search in sub-linear time. For most use cases, the results are identical to brute-force search but orders of magnitude faster. Here's what the code looks like — this won't run in the browser but it's what you'd use in a Python script or server.

FAISS vector search (local Python only)
Loading editor...

The key insight: our NumPy search and FAISS search produce the same results. FAISS just does it faster. The RAG logic — chunking, embedding, prompting — stays exactly the same. That's why building from scratch first matters: you understand every piece, so swapping the search backend is a one-function change.

Choosing a FAISS Index Type

FAISS offers several index types for different tradeoffs. Here's a practical guide:

FAISS index types comparison
Loading editor...

Common Mistakes and How to Fix Them

I've debugged a lot of RAG systems. These are the problems that come up most often, roughly in order of how much time they waste.

Mistake 1: Chunks Are Too Large

Entire document as one chunk
# Bad: embedding an entire page as one chunk
chunks = [{"text": full_document_text}]
# The embedding averages all topics together
# Retrieval becomes imprecise
Sentence-aware chunks with overlap
# Good: focused chunks with overlap
chunks = chunk_by_sentences(
    full_document_text,
    max_chunk_size=300,
    overlap_sentences=1
)
# Each chunk covers one focused topic

Mistake 2: No System Prompt Grounding

Without explicit instructions to stay grounded, the LLM will happily answer from its training data instead of your documents. This defeats the purpose of RAG entirely — you wanted answers from your data, not the internet.

No grounding instruction
messages = [
    {"role": "user", "content": f"Context: {context}\nQuestion: {query}"}
]
# LLM may ignore context and answer from training data
Explicit grounding in system prompt
messages = [
    {"role": "system", "content":
        "Answer ONLY from the provided context. "
        "If the answer is not in the context, say so."},
    {"role": "user", "content": f"Context: {context}\nQuestion: {query}"}
]

Mistake 3: Mixing Embedding Models

Embedding your documents with text-embedding-3-small and your queries with text-embedding-3-large produces vectors in different spaces. Similarity scores become meaningless. Always use the same model for both.

Mistake 4: Not Testing Retrieval Separately

When a RAG system gives a wrong answer, developers usually blame the LLM. But nine times out of ten, the problem is retrieval — the right chunks never made it into the prompt. Always test retrieval independently, the way we did with our evaluate_retrieval function, before debugging generation.

Build a Top-K Retrieval Function
Write Code

Write a function top_k_similar(query_vec, vectors, k=3) that takes a query vector (list of floats), a list of vectors (list of lists of floats), and returns the indices of the k most similar vectors by cosine similarity, sorted from most to least similar.

Use only math — no NumPy.

Return a list of integers (the indices).

Loading editor...

Putting It All Together — The Complete RAG Pipeline

Let's wire everything into a clean, reusable class. This is the code structure I use as a starting point for real projects — it keeps the four RAG stages (chunk, embed, retrieve, generate) clearly separated so you can swap any piece independently.

SimpleRAG class — Part 1: indexing
Loading editor...
SimpleRAG class — Part 2: querying
Loading editor...

That's the entire system: a class with two methods. add_documents handles chunking and embedding. query handles retrieval and generation. Everything else — FAISS indexing, caching, streaming — is an optimization you layer on top.

Where to Go from Here

You've built a working RAG system from zero. Here's what you'd add for production use, roughly in priority order:

1. Persistent vector storage. Right now, embeddings live in memory and vanish when the script ends. Use FAISS to save/load indices to disk, or use a managed vector database like Pinecone, Weaviate, or Qdrant.

2. Better chunking strategies. Split on section headers, paragraph breaks, or semantic boundaries rather than pure character count. For code documentation, keep function signatures with their docstrings.

3. Metadata filtering. Tag chunks with metadata (date, author, category) and filter before vector search. "What changed in the API this month?" should only search chunks from the current month.

4. Re-ranking. After vector search returns the top 20 candidates, use a cross-encoder model to re-rank them. Cross-encoders are slower but more accurate because they see the query and chunk together.

5. Hybrid search. Combine vector similarity with keyword matching (BM25). Some queries — especially those with specific names, error codes, or IDs — are better served by exact keyword match than by semantic similarity.

Frequently Asked Questions

How much does the OpenAI embedding API cost?

text-embedding-3-small costs $0.02 per million tokens. Embedding 10,000 chunks of 300 characters each (roughly 75 tokens each) uses about 750,000 tokens — that's under $0.02. Embeddings are a one-time cost per document; you only re-embed when the content changes.

Should I use text-embedding-3-small or text-embedding-3-large?

Start with text-embedding-3-small (1,536 dimensions). It's cheaper, faster, and sufficient for most RAG use cases. Switch to text-embedding-3-large (3,072 dimensions) only if you've measured retrieval accuracy and found it lacking. The larger model costs 6.5x more per token.

Can I use open-source embedding models instead of OpenAI?

Yes. Models like sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) or BAAI/bge-small-en-v1.5 (384 dimensions) run locally with the sentence-transformers library. They're free, private, and competitive with OpenAI on many benchmarks. The tradeoff is that you need to run inference yourself, which requires more compute.

Alternative: open-source embeddings
Loading editor...

How many chunks should I retrieve (top_k)?

Start with top_k=3 and adjust. Too few chunks and you miss relevant context. Too many and you dilute the signal with irrelevant text, waste tokens, and may confuse the LLM. Monitor your context window budget — gpt-4o-mini supports 128K tokens, but answers tend to degrade when you stuff in more than a few thousand tokens of context.

What is the difference between RAG and fine-tuning?

Fine-tuning bakes knowledge into the model's weights — it changes how the model behaves. RAG provides knowledge at query time through the prompt — it changes what the model sees. Use RAG when your data changes frequently (docs, knowledge bases). Use fine-tuning when you want the model to adopt a specific style or follow domain-specific reasoning patterns. They are complementary, not competing, approaches.

References

  • Lewis, P. et al. — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. arXiv:2005.11401
  • OpenAI — Embeddings API documentation. Link
  • OpenAI — Text embedding models and pricing. Link
  • Johnson, J., Douze, M., Jegou, H. — "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data, 2019. arXiv:1702.08734
  • FAISS documentation — Facebook Research. Link
  • Sentence-Transformers documentation. Link
  • Gao, L. et al. — "Precise Zero-Shot Dense Retrieval without Relevance Labels." ACL 2023. arXiv:2212.10496
  • Related Tutorials