Chunking Strategies for RAG: 8 Methods Compared in Python

Intermediate90 min3 exercises300 XP

Prerequisites

Notebook

You build a RAG pipeline. The retrieval looks fine on three test queries. Then a user asks something that spans two paragraphs in your source document — and the answer comes back garbled, because your chunking strategy split the key sentence right down the middle.

Chunking is the most under-discussed bottleneck in retrieval quality. Most tutorials default to "recursive splitting at 512 tokens" and move on. That works — until it doesn't. In this lab, we implement 8 different chunking strategies from scratch in pure Python, run them on the same document, and benchmark which ones actually retrieve the right answers.

Here's the pipeline we're building. A single technical document goes in. Eight chunking functions each slice that document into overlapping pieces — from brute-force character splits to meaning-aware boundaries. Each strategy produces a different set of chunks. We then build a simple bag-of-words retrieval engine (no external embeddings needed) and fire the same 8 test questions at every strategy. The output is a head-to-head recall table: same document, same queries, eight strategies, one winner.

Prerequisites: Familiarity with RAG fundamentals and basic Python string operations. No external libraries required — everything runs in your browser.

Why Chunking Strategy Matters More Than Most People Think

A chunking strategy determines where your document gets cut into pieces for retrieval. Get it wrong, and your retriever returns fragments that are too short to be useful, too long to be precise, or split right through the sentence that contains the answer.

I've seen production RAG systems where changing the chunk size from 1000 to 400 characters improved answer accuracy by 25%. The model was the same. The embeddings were the same. Only the splits changed.

Three things go wrong with bad chunking:

Problem	What Happens	Example
Information splitting	A key fact gets cut across two chunks	"The learning rate should be" / "0.001 for transformers"
Context dilution	Chunk is so large that the relevant sentence drowns in noise	A 2000-char chunk where only 1 sentence matters
Boundary artifacts	Chunks start or end mid-sentence, confusing the retriever	"...and therefore the model. ## Training"

---

> Key Insight: Chunking isn't a preprocessing step you set once and forget. It's a retrieval hyperparameter — one that often matters more than which embedding model you pick. Research from NVIDIA (2024) found that chunk boundary placement explained more variance in retrieval accuracy than embedding model choice across their benchmark suite.

Setup: Our Test Document and Retrieval Engine

Before we compare strategies, we need two things: a realistic document to chunk, and a simple retrieval engine to measure quality. The document below is a technical article about training neural networks — with headings, paragraphs, and lists. Every strategy will chunk this same text.

The first code cell defines the document as a Python string and prints basic statistics. We'll reuse this DOCUMENT variable throughout the entire notebook.

[ ]

We also need a way to search chunks without external embedding libraries. The cell below builds a bag-of-words retrieval engine from scratch. It converts each chunk into a word-frequency vector, computes cosine similarity against a query, and returns the top matches.

The SimpleRetriever class does three things: index tokenizes each chunk into lowercase words and builds term-frequency vectors, _cosine_sim computes similarity between two vectors, and search ranks all indexed chunks against a query and returns the top-k results.

[ ]

Expected output — click Run to verify

Sanity check — searching for 'cat sitting on mat':
  Score 0.5976: The cat sat on the mat
  Score 0.0: Dogs love to fetch balls

Strategy 1: Fixed-Size Character Splitting (The Baseline)

The simplest possible strategy. Pick a chunk size (say 500 characters) and an overlap (say 100 characters), then slide a window across the text. The overlap ensures that a sentence straddling a boundary appears in at least one chunk intact.

Fixed-size splitting is the baseline every other strategy should beat. It's fast, predictable, and requires zero understanding of the text. Its weakness is equally obvious: it cuts through sentences, paragraphs, and headings without any awareness of meaning.

[ ]

Notice how chunks cut right through sentences. The overlap rescues some broken sentences by including them in the next chunk — but not every mid-sentence split gets lucky. If the overlap window lands inside a different sentence, that sentence stays broken across both chunks.

Fixed-size splitting is still useful as a baseline benchmark and for contexts where you need absolute predictability about chunk size, like strict token budgets.

---

Strategy 2: Recursive Character Splitting (The Popular Default)

If you've used LangChain, you've probably used RecursiveCharacterTextSplitter. It tries to split on the largest natural boundary that keeps chunks under the size limit. The separator hierarchy is: double newline (paragraph break), single newline, period-space, space, then individual characters.

The idea is elegant: prefer paragraph-level splits, but fall back to sentence-level or word-level when paragraphs are too long. Here it is from scratch.

[ ]

Recursive splitting typically produces more chunks with shorter average length than fixed-size. That's because it prefers to split at paragraph boundaries even when that means leaving unused space in a chunk. The trade-off: more chunks to search, but each one is more likely to be a coherent unit of information.

> Key Insight: Recursive splitting is still size-based — it just prefers clean boundaries. It can't tell the difference between a paragraph break within a section and a section boundary. For documents with strong heading structure, strategy 6 (markdown-aware) exploits that distinction.

---

Strategy 3: Token-Based Splitting (Precise Budget Control)

LLMs don't see characters — they see tokens. A 500-character chunk might be 100 tokens or 150 depending on the vocabulary. When you need precise control over how many tokens each chunk consumes in a context window, character-based splitting is a rough proxy at best.

Token-based splitting counts words (as a token approximation) and splits at word boundaries. Real tokenizers like BPE have their own logic, but word-level counting gives a solid approximation without importing tiktoken.

[ ]

> Practical trade-off: Token splitting gives you predictable context window consumption, but it still cuts through sentences just like fixed-size splitting. The overlap helps, yet the fundamental problem remains: the split point is determined by count, not by content. Pair this with sentence detection for the best of both worlds — which is exactly what strategy 4 does.

---

Strategy 4: Sentence-Based Splitting (Natural Boundaries)

Why cut through sentences at all? Sentence-based splitting first breaks the document into individual sentences, then groups them into chunks that stay under a size limit. Every chunk boundary falls between sentences, never inside one.

The sentence detector below handles the common cases: period-space, question marks, exclamation marks, and double newlines. It isn't as robust as spaCy's segmenter, but it covers technical writing well enough for our purposes.

[ ]

The big win: no more cut sentences. Every chunk is a group of complete sentences. The downside is that chunk sizes vary more — some paragraphs naturally clump into large chunks while others barely fill a small one. For retrieval, complete sentences almost always outperform partial ones.

> Key Insight: Sentence boundaries matter most. Every strategy that respects sentence boundaries outperforms those that don't — regardless of chunk size. If you remember one rule from this tutorial, it's this: never split mid-sentence.

---

Exercise 1: Compare Chunk Statistics

Your task: Run all four strategies on our document and build a summary table. For each strategy, record the number of chunks, the average chunk length in characters, and the shortest and longest chunk.

Hints:
1. Use the min() and max() functions with key=len to find shortest/longest chunks.
2. Format the output as a simple text table with columns aligned using f-strings.

[ ]

Strategy 5: Semantic Chunking (Split by Meaning, Not Characters)

Every strategy so far splits by counting characters, words, or sentences. None of them ask: "Does the meaning of the text actually change here?"

Semantic chunking tries to detect topic shifts. Compute similarity between consecutive sentences, and when similarity drops below a threshold, insert a chunk boundary. Sentences about the same concept stay together; sentences that pivot to a new topic get split apart.

---

Without embedding models, we can approximate semantic similarity using word overlap (Jaccard similarity). Two sentences about "learning rate" and "training speed" share vocabulary; sentences about "data augmentation" and "gradient clipping" share almost none. The function below computes Jaccard similarity between consecutive sentence pairs, flags drop-off points, and groups sentences into semantically coherent chunks.

[ ]

Semantic chunking produces unevenly sized chunks — a long discussion about one topic becomes a massive chunk, while a brief mention of another becomes tiny. That variability is a feature: each chunk represents one coherent idea. Chroma Research (2024) showed semantic chunking achieving 91-92% recall versus 85-90% for fixed recursive splits.

The downside? The threshold is sensitive. Too low and everything merges into one giant chunk. Too high and every sentence becomes its own chunk. With real embedding models you get more reliable similarity scores, but the tuning challenge remains.

---

Strategy 6: Markdown/HTML-Aware Splitting (Structure-Preserving)

But what if the document already tells you where to split? Technical documents have explicit structure: headings, lists, code fences. A markdown-aware splitter uses those heading boundaries as its primary splitting signal. Each section becomes a chunk. If a section exceeds the size limit, it falls back to sentence-level splitting within that section.

[ ]

Each chunk starts with its heading — the retriever sees "## Regularization Techniques" right at the top of the chunk about dropout and weight decay. That heading acts as a built-in label, improving retrieval accuracy almost for free.

I find markdown-aware splitting especially effective for internal documentation and knowledge bases where authors follow consistent heading conventions. The predictability makes debugging retrieval failures much easier — you can trace exactly which section a chunk came from.

---

Strategy 7: Hierarchical Chunking (Parent-Child Relationships)

Here's a problem with every strategy so far: when you retrieve a small, precise chunk, you lose surrounding context. The retrieved chunk says "Dropout randomly sets a fraction of neuron outputs to zero" — but the user asked about regularization in general, and the answer spans three paragraphs.

Hierarchical chunking solves this with two levels: large parent chunks for context, and smaller child chunks for precise retrieval. You search the children, but when you build the LLM prompt, you include the parent. I think of it like a book index — the index entry (child) points you to the right page, but you read the full page (parent) for context.

[ ]

The retrieval workflow: search child chunks (small, precise), identify which parent each match belongs to, then pass the full parent text to the LLM. The child gives you precision in finding the right spot. The parent gives the LLM enough surrounding context to generate a complete answer.

This two-level approach is used by LlamaIndex and many production RAG pipelines. The trade-off is index size — you store both levels — but storage is cheap compared to wrong answers.

---

Strategy 8: Contextual Chunking (Anthropic's Approach)

Anthropic published a technique called Contextual Retrieval in 2024 that flips the usual approach. Instead of preserving context through clever splitting, you add context to each chunk after splitting. Take any chunking strategy, then prepend a brief header explaining where the chunk came from and what it covers.

A chunk that says "Dropout rate is 0.1" is ambiguous. A chunk that says "[Section: Regularization Techniques] Dropout rate is 0.1" retrieves correctly for far more queries. This is the strategy I'm most excited about — it's simple, composable, and the results speak for themselves.

In production, you'd use an LLM to generate the context header for each chunk. For our benchmark, we approximate this by extracting section headings and matching each chunk to its most relevant section via word overlap. The code below takes any set of chunks, identifies which section each belongs to, and prepends a context header. The original content stays intact — each chunk just carries an extra label now.

[ ]

Anthropic reported that contextual chunking improved retrieval accuracy by 15-20% and reduced retrieval failures by nearly 50%. The beauty of the approach: it works as a layer on top of any other strategy. Apply it to recursive splits, sentence splits, or markdown splits.

The trade-off is cost. Generating context headers with an LLM adds latency and API cost during indexing. But indexing is a one-time operation — the retrieval quality improvement pays for itself on every query.

> Key Insight: Context headers are essentially free metadata. You don't change your splitting logic, don't retrain anything, and don't touch your retrieval algorithm. You just prepend a short label to each chunk. That's one of the highest-return, lowest-effort improvements in all of RAG.

---

Exercise 2: Retrieval Benchmark

Your task: Build a benchmark that tests all 8 chunking strategies on the same questions. For each question, we know which section contains the answer. A strategy "wins" for a question if the correct section keyword appears in the top-3 retrieved chunks.

The code below defines 8 test questions with expected answer sections, runs each strategy through the retriever, and scores hit rate. This is the head-to-head comparison we've been building toward.

[ ]

The score_strategy helper indexes a set of chunks, fires each question at the retriever, and counts how many times the correct section keyword appears in the top-3 results. We'll reuse this function in Exercise 3 and the final benchmark.

[ ]

Expected output — click Run to verify

score_strategy() defined. Ready to benchmark.

Now we build a dictionary mapping strategy names to their chunk lists, and score each one against our test questions.

[ ]

A few patterns should emerge from the results. Markdown-aware splitting tends to score well because each chunk is a complete section with its heading intact — which matches query keywords. Contextual chunking benefits from the added section labels. Fixed-size splitting struggles when the answer sentence lands near a chunk boundary.

This is a simplified benchmark using bag-of-words similarity. With real embedding models, relative rankings can shift — but the lesson holds: chunk boundaries and chunk context affect retrieval quality at least as much as the choice of embedding model.

---

Exercise 3: Improve Retrieval with Contextual Headers

Your task: Apply contextual headers to the sentence-based chunks and re-run the benchmark. Compare hit rate of plain sentence chunks versus contextualized sentence chunks.

Hints:
1. Use add_context_headers(DOCUMENT, sentence_chunks) to create the contextualized version.
2. Run the same benchmark loop for both versions.

Predict first: do you expect the contextual version to do better, worse, or about the same?

[ ]

In our bag-of-words setup, contextual headers add section-name words to each chunk, directly increasing keyword overlap with relevant queries. With real embedding models, the improvement comes from a different mechanism — the embeddings capture more of the document's meaning, making vectors more distinctive.

Either way, adding context to chunks is one of the highest-return, lowest-effort improvements you can make to any RAG pipeline.

---

The Benchmark: Same Docs, Same Queries, 8 Strategies

Time to put it all together. I'll now run all 8 strategies, score them on test questions, and add chunk statistics — how many chunks each produces, their average size, and the retrieval hit rate.

This is the table you'd build for your own documents when choosing a chunking strategy. The "best" strategy depends on your document structure and query patterns, but this gives you a framework for making that decision empirically.

[ ]

Choosing a strategy — a practical decision framework:

Situation	Recommended Strategy
Quick prototype, any document type	Recursive (strategy 2)
Documents with clear headings (docs, READMEs)	Markdown-aware (strategy 6)
Need precise context window control	Token-based (strategy 3)
Complex questions spanning multiple paragraphs	Hierarchical (strategy 7)
Already have a pipeline, want easy improvement	Add contextual headers (strategy 8)
Academic papers with dense information	Sentence-based (strategy 4) + contextual
Budget for LLM-based indexing	Contextual chunking on any base strategy

---

One more data point worth knowing: a NAACL 2025 study found that simple fixed 200-word chunks matched or beat semantic chunking on several benchmarks when paired with a strong embedding model. The researchers argued that complex chunking provides diminishing returns once the embedding model is good enough. That doesn't mean chunking doesn't matter — it means the interaction between chunking strategy and embedding quality matters more than either one alone.

Common Mistakes and How to Fix Them

Mistake 1: Using fixed-size splitting without overlap on structured documents

Splitting without overlap guarantees that sentences straddling a boundary get cut in half. Both halves become meaningless fragments that match nothing.

[ ]

Fix: Always use at least 10-20% overlap. For 500-character chunks, 50-100 characters of overlap is standard. Better yet, use sentence-based splitting so you don't need overlap at all.

Mistake 2: Setting semantic chunking threshold too low

With a low similarity threshold, nearly every sentence pair scores above it, and your entire document merges into one or two massive chunks. You end up with chunks that are 5,000+ characters and match every query equally poorly.

Fix: Start with a threshold of 0.1-0.15 for word-overlap metrics, or 0.3-0.5 for real embedding similarity. Always inspect chunk count and average size before deploying. If you have fewer than 5 chunks from a multi-page document, your threshold is too low.

---

Mistake 3: Ignoring document structure (chunking markdown as plain text)

Treating a markdown document the same as a plain text transcript wastes valuable structural information. Headings, bullet lists, and code fences are explicit signals about where topics begin and end.

Fix: Use markdown-aware splitting for structured documents. For unstructured text (transcripts, chat logs), fall back to sentence-based or semantic splitting.

Mistake 4: Not testing chunk quality before deploying

This one's surprisingly common. Teams spend days building a RAG pipeline, deploy it, and only discover chunking problems when users complain about bad answers. By then, debugging is a nightmare because the problem could be chunking, embedding, retrieval, or generation.

Fix: Before deploying, take 5-10 representative questions. Manually identify which part of the source document contains each answer. Run your chunking strategy and verify the answer text lands intact within a single chunk. If it's split across two chunks for more than 20% of your test questions, adjust your strategy.

Complete Code

Below is the complete benchmark combining all 8 strategies. Since this is a notebook with shared state, every function defined above is still available. This final cell regenerates all chunks and runs the full comparison.

[ ]

Frequently Asked Questions

What chunk size should I use for RAG?

Start with 200-500 tokens (roughly 800-2000 characters). Smaller chunks give better retrieval precision but require more chunks to cover a document. Larger chunks provide more context per result but reduce precision. Benchmark on your actual queries — there's no universal best size.

Can I combine multiple chunking strategies?

Absolutely, and you often should. A common production pattern is markdown-aware splitting as the base with contextual headers on top. Another combination: hierarchical chunking for context with sentence-based children for precision. The strategies in this tutorial are composable building blocks, not mutually exclusive choices.

---

Does chunking matter if I use a long-context model like GPT-4 Turbo?

Yes. Even with 128K-token context windows, you still need chunking for two reasons. Stuffing the entire document set into the prompt is expensive and slow. And retrieval precision still matters — including 50 irrelevant chunks alongside the relevant one can decrease answer quality compared to retrieving just the top 3.

How do I evaluate chunking quality without a full benchmark?

Quick sanity check: take 5 questions you expect your RAG system to answer, manually identify which part of the source document contains the answer, then check whether your chunking keeps that passage intact within a single chunk. If the answer is split across two chunks for more than 1 of 5 questions, your strategy needs adjustment.

---

Is semantic chunking always better than recursive splitting?

No. A NAACL 2025 study found that simple fixed-size chunks (200 words) matched semantic chunking on several benchmarks when paired with a strong embedding model. Semantic chunking helps most when the embedding model is weaker or documents have dramatic topic shifts. For well-structured documents with clear headings, markdown-aware splitting often outperforms both.

References

Anthropic — Introducing Contextual Retrieval (2024). Blog post
LangChain Documentation — Text Splitters. Docs
Chroma Research — Chunking Strategies for RAG Retrieval (2024). Report
NVIDIA Technical Blog — Document Chunking for Large-Scale RAG (2024). Blog
Kamradt, G. — 5 Levels of Text Splitting for Retrieval (2023). YouTube
NAACL 2025 — Revisiting Chunking Strategies for Retrieval-Augmented Generation. Conference proceedings.
Pinecone Documentation — Chunking Strategies. Docs
LlamaIndex Documentation — Node Parsers and Text Splitters. Docs