Skip to main content

LangChain Text Splitters: Chunking Strategies for Optimal Retrieval Quality

Intermediate60 min2 exercises60 XP
0/2 exercises

You loaded 200 pages of documentation into your RAG pipeline and asked a specific question. The answer came back vague, stitched together from unrelated paragraphs. The problem isn't your embedding model or your vector store — it's how you split the text. Bad chunks produce bad retrievals, and no amount of prompt engineering can fix that.

This tutorial walks through every major text splitting strategy in LangChain, shows you exactly when each one shines (and when it fails), and gives you a framework to measure chunk quality before you ever plug it into a retrieval pipeline.

Why Text Splitting Matters for RAG

When you embed a document for retrieval-augmented generation (RAG), you're converting text into numerical vectors that capture semantic meaning. But you can't embed an entire 50-page PDF as one vector — the meaning gets diluted. A single vector trying to represent 50 pages of content captures nothing well.

So you split the document into smaller pieces — chunks — and embed each one separately. When a user asks a question, you find the chunks whose vectors are closest to the question's vector, then feed those chunks to the LLM as context.

Here's where it gets tricky. I've seen teams spend weeks tuning their embedding model when the real problem was that their chunks were splitting sentences in half, or mixing two unrelated topics in the same chunk. The quality of your splits directly determines the quality of your retrievals.

LangChain provides several text splitters, each designed for different document types and use cases. We'll work through them from simplest to most sophisticated — starting with the naive baseline so you understand what it gets wrong, then moving to the one I reach for 80% of the time.

CharacterTextSplitter — The Naive Baseline

The simplest splitter cuts text at a fixed character count. It's the equivalent of slicing a string every N characters — fast, predictable, and often terrible for retrieval quality. I'm showing it first because understanding why it fails teaches you what to look for in better splitters.

CharacterTextSplitter — splitting on newlines
Loading editor...

The output shows chunks split at every newline boundary, with each chunk kept under 120 characters. The separator parameter controls where the splitter is allowed to cut. With "\n", it can only cut at line breaks.

The problem? This splitter doesn't understand content structure. It will happily split a paragraph mid-thought if the character count demands it. And with chunk_overlap=0, a sentence split across two chunks loses context in both.

RecursiveCharacterTextSplitter — The Default Choice

This is the splitter I use by default, and the LangChain docs recommend it too. Instead of a single separator, it tries a hierarchy of separators — splitting first by paragraphs, then by sentences, then by words, and finally by characters. It works through the list until chunks are small enough.

RecursiveCharacterTextSplitter with default separators
Loading editor...

The separator hierarchy ["\n\n", "\n", ". ", " ", ""] tells the splitter: "First try splitting on blank lines (paragraph breaks). If a chunk is still too big, try single newlines. Still too big? Try sentence boundaries. Then word boundaries. As a last resort, split on every character."

That chunk_overlap=50 means the last 50 characters of one chunk also appear at the start of the next chunk. This overlap is the critical detail most tutorials skip — without it, a sentence split across two chunks is incomplete in both. With overlap, the boundary region is duplicated, so the full sentence exists in at least one chunk.

How the Recursive Logic Works Step by Step

Let's trace through the algorithm to build intuition. I find that people who understand this step-by-step never misconfigure chunk sizes.

Tracing the recursive splitting logic
Loading editor...

The beauty of this approach: it always prefers the largest natural boundary. Your chunks end at paragraph breaks when possible, sentence breaks when necessary, and word breaks only as a fallback. This preserves semantic coherence far better than fixed-character splitting.

Chunk Size and Overlap — The Parameters That Matter Most

Choosing chunk_size and chunk_overlap is the single most impactful decision in your chunking pipeline. I've watched people agonize over embedding models when their chunk size was the actual bottleneck.

Comparing chunk sizes on the same document
Loading editor...

More chunks mean more embeddings to store and search through. Fewer chunks mean each embedding represents a larger, less focused region of text.

Here's the practical guidance I give to anyone starting a RAG project:

ScenarioChunk SizeOverlapWhy
Short, factual docs (FAQs, glossaries)200-400 chars0-20Each entry is self-contained
Technical documentation500-1000 chars50-100Paragraphs carry full ideas
Long-form articles or books1000-2000 chars200-400Need context around each idea
Legal or regulatory text500-800 chars100-200Precision matters, clauses are dense

What Happens When Chunks Are Too Small

When chunks are too small, individual chunks lack context. A chunk that says "It achieves 95% accuracy" without mentioning what "it" refers to is useless for retrieval. The embedding captures the idea of accuracy, but the LLM receiving this chunk can't produce a meaningful answer.

Too-small chunks lose meaning
Loading editor...

With 80-character chunks, "It achieves 95% accuracy" lands in its own chunk — orphaned from the subject. With 250-character chunks, the full concept of random forests and their accuracy stays together.

Token-Based Splitting — Aligning with LLM Context Windows

Character counts are a rough proxy for chunk size, but LLMs don't consume characters — they consume tokens. A 500-character chunk might be 100 tokens or 150, depending on the text. When you need precise control over how many tokens each chunk uses in your LLM's context window, split by tokens directly.

Token-based splitting with tiktoken
Loading editor...

The from_tiktoken_encoder class method creates a splitter that uses OpenAI's tiktoken tokenizer to measure chunk size. The encoding_name="cl100k_base" matches the tokenizer used by GPT-4 and GPT-3.5-turbo. If you're using a different model, match the tokenizer accordingly.

When should you use token-based splitting over character-based? In my experience, it matters most when you're packing retrieved chunks into a prompt with a tight token budget. If your LLM context window is 4096 tokens and you're reserving 1000 for the system prompt and response, you need to know exactly how many tokens each chunk consumes.

Structure-Aware Splitting — Markdown and HTML

Technical documentation, README files, and web content have explicit structure — headings, code blocks, lists. A generic splitter ignores all of this. MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter use document structure as the primary splitting signal, which is a massive upgrade for structured documents.

MarkdownHeaderTextSplitter

MarkdownHeaderTextSplitter preserves heading hierarchy
Loading editor...

Each chunk gets metadata showing exactly where it falls in the document hierarchy — {"h1": "Machine Learning Guide", "h2": "Supervised Learning", "h3": "Classification"}. This metadata is gold for retrieval. When a user asks about classification, you can filter by metadata before doing semantic search, or include the heading path in the chunk text for richer embeddings.

One subtlety that tripped me up early: MarkdownHeaderTextSplitter does not enforce a maximum chunk size. If a section under a heading is 10,000 characters, you get one 10,000-character chunk. You'll often want to chain it with RecursiveCharacterTextSplitter to break long sections into smaller pieces while preserving the heading metadata.

Chaining MarkdownHeaderTextSplitter with RecursiveCharacterTextSplitter
Loading editor...

This two-step approach — structural split first, size enforcement second — is my go-to pipeline for any markdown-based documentation. The heading metadata carries through to the final chunks, so you get both size control and structural context.

HTMLHeaderTextSplitter

The HTML variant works the same way, but splits on HTML heading tags instead of markdown headers.

HTMLHeaderTextSplitter for web content
Loading editor...

If you're ingesting web-scraped content or HTML documentation, this splitter preserves the page structure that RecursiveCharacterTextSplitter alone would ignore.

Code-Aware Splitting — Language-Specific Separators

Splitting source code with a generic text splitter is a recipe for disaster. A function split in half is worse than useless — it's misleading. LangChain provides language-specific separators that understand code structure.

Language-aware splitting for Python code
Loading editor...

The Language.PYTHON setting replaces the generic separator hierarchy with Python-specific separators: class definitions, function definitions, and decorators become the primary split points. The splitter tries to keep each function or class method as a complete unit.

LangChain supports separators for many languages. Here are the ones I find most useful:

LanguageEnum ValuePrimary Split Points
PythonLanguage.PYTHONclass, def, decorators
JavaScriptLanguage.JSfunction, class, const/let/var
TypeScriptLanguage.TSinterface, class, function, type
GoLanguage.GOfunc, type, package
JavaLanguage.JAVAclass, public/private methods
MarkdownLanguage.MARKDOWNHeading levels (#, ##, ###)

Semantic Splitting — Letting Embeddings Decide Where to Cut

Every splitter we've seen so far uses rules — character counts, separators, heading tags. Semantic splitting takes a fundamentally different approach: it uses an embedding model to detect where the topic changes, and splits there.

The idea is intuitive. Embed consecutive sentences. When the cosine similarity between adjacent sentence embeddings drops sharply, that's a topic boundary — a natural place to split.

SemanticChunker detects topic boundaries
Loading editor...

The text above has three clear topic shifts — from Python history to machine learning to transformers. A semantic splitter should detect these boundaries automatically, producing three chunks that each cover a coherent topic.

The breakpoint_threshold_type parameter controls how the splitter decides where to cut:

Threshold TypeHow It WorksWhen to Use
"percentile"Splits where similarity falls below the Nth percentileMost common; good default
"standard_deviation"Splits where similarity drops by N standard deviationsWorks well on uniform-length sentences
"interquartile"Splits at outlier low-similarity points (IQR method)Robust to noisy sentence embeddings

Comparing Splitters Side by Side

Knowing each splitter individually is useful. Knowing when to pick one over another is what actually matters. Here's a comparison framework I use on every RAG project.

Side-by-side comparison of splitter behaviors
Loading editor...

Running this on the same document reveals the differences clearly. CharacterTextSplitter produces the most uniform chunks but splits mid-sentence. RecursiveCharacterTextSplitter respects sentence boundaries. MarkdownHeaderTextSplitter produces the most semantically coherent chunks but with variable sizes.

Here's the decision framework in a nutshell:

Document TypeRecommended SplitterWhy
Plain text, unstructuredRecursiveCharacterTextSplitterRespects natural boundaries
Markdown docs, README filesMarkdownHeaderTextSplitter + RecursivePreserves heading structure
HTML pages, web scrapesHTMLHeaderTextSplitter + RecursiveUses page structure
Source codeRecursiveCharacterTextSplitter.from_language()Keeps functions intact
Mixed-topic long documentsSemanticChunkerDetects topic shifts
Token-budget-sensitive appsRecursiveCharacterTextSplitter.from_tiktoken_encoder()Precise token control

Real-World Pipeline: Splitting a Technical Documentation Set

Let's put everything together. You're building a RAG system over a set of markdown documentation files. Each file has headings, code blocks, and prose. You need chunks that are small enough for precise retrieval, tagged with their source location, and bounded to a token budget.

Production-ready markdown splitting pipeline
Loading editor...
Running the pipeline on a sample document
Loading editor...

Each output chunk now carries metadata: the filename, the heading hierarchy, and a text body bounded to 800 characters. This is exactly what you'd feed to an embedding model and then into a vector store. The metadata enables filtered retrieval — "only search within the Installation Guide" — without re-embedding.


Build a Chunk Quality Analyzer
Write Code

Write a function analyze_chunks(texts: list[str]) -> dict that takes a list of text chunks and returns a dictionary with these statistics:

  • "count": total number of chunks
  • "avg_length": average character length (rounded to nearest integer)
  • "min_length": length of the shortest chunk
  • "max_length": length of the longest chunk
  • "short_chunks": count of chunks with fewer than 50 characters
  • "long_chunks": count of chunks with more than 500 characters
  • This is the kind of quality check you'd run after splitting documents to catch bad configurations.

    Loading editor...

    Common Mistakes and How to Fix Them

    I've debugged chunking pipelines for enough projects to see the same mistakes repeated. Here are the top five.

    Mistake 1: Zero Overlap

    Zero overlap — sentences split at boundaries
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=200,
        chunk_overlap=0,  # no overlap!
    )
    # Chunk 1 ends: "...the model achieves"
    # Chunk 2 starts: "95% accuracy on test data."
    # Neither chunk has the full sentence.
    With overlap — boundary sentences preserved
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=200,
        chunk_overlap=40,  # 20% overlap
    )
    # Chunk 1 ends: "...the model achieves 95% accuracy on test data."
    # Chunk 2 starts: "achieves 95% accuracy on test data. The next..."
    # Full sentence exists in at least one chunk.

    Mistake 2: Using Character Splitting for Structured Documents

    If your document has headings, use a structure-aware splitter. Splitting a markdown doc with RecursiveCharacterTextSplitter alone throws away all the heading context that would make your chunks more searchable.

    Mistake 3: Not Inspecting Your Chunks

    This is the most common mistake. People configure a splitter, pipe the chunks directly to an embedding model, and never look at what the chunks actually contain. Always print a sample of your chunks and read them. If a chunk doesn't make sense to you when you read it in isolation, it won't make sense to the embedding model either.

    Always inspect a sample of your chunks
    Loading editor...

    Mistake 4: Chunk Size Way Too Small

    I see beginners set chunk_size=100 thinking smaller chunks mean more precise retrieval. In practice, 100-character chunks lack context and produce poor embeddings. Unless you're working with very short, self-contained items (like FAQ answers), keep chunks above 200 characters.

    Mistake 5: Ignoring Metadata

    Structure-aware splitters produce metadata (heading hierarchy, source file) that you should preserve and index. Metadata enables filtered retrieval — searching only within a specific document section — which is often more effective than pure vector similarity across your entire corpus.


    Implement a Simple Recursive Text Splitter
    Write Code

    Implement a function recursive_split(text: str, max_size: int, separators: list[str]) -> list[str] that mimics the recursive splitting logic.

    Rules:

    1. Try each separator in order. Split the text using the first separator that appears in the text.

    2. After splitting, check each piece. If a piece is within max_size, keep it.

    3. If a piece exceeds max_size, recursively split it using the remaining separators (starting from the next one).

    4. If no separators work (or the list is exhausted), return the text as-is (even if it exceeds max_size).

    5. Remove empty strings from the result.

    Do NOT implement overlap — just the core recursive logic.

    Loading editor...

    Complete Code

    Full script — copy and run locally
    Loading editor...

    Frequently Asked Questions

    How do I choose between character-based and token-based splitting?

    Use character-based splitting as your default — it's faster and simpler. Switch to token-based splitting when you need precise control over how many tokens each chunk consumes in your LLM's context window. This matters when your prompt template has a tight token budget and you need to fit exactly N chunks.

    Can I create a custom splitter for my document format?

    Yes. Subclass TextSplitter from langchain_text_splitters and override the split_text method. For most cases, though, you can get away with customizing the separators list on RecursiveCharacterTextSplitter.

    Custom separators for non-standard document formats
    Loading editor...

    How does chunk overlap affect storage and cost?

    Overlap duplicates text across chunks, which means more chunks total. With 20% overlap, you get roughly 25% more chunks than with zero overlap. This increases embedding storage and vector search time proportionally. For most applications this is a worthwhile tradeoff — the improvement in retrieval quality from not losing boundary context outweighs the extra cost.

    Should I include metadata in the chunk text before embedding?

    Sometimes. Prepending the heading hierarchy to the chunk text (e.g., "Machine Learning Guide > Supervised Learning > Classification: ...") gives the embedding model more context. This helps when section headings carry important topical information. But it also increases chunk length and can dilute the semantic signal for very short chunks.

    References

  • LangChain Documentation — Text Splitters. Link
  • LangChain API Reference — RecursiveCharacterTextSplitter. Link
  • LangChain Documentation — MarkdownHeaderTextSplitter. Link
  • LangChain Documentation — HTMLHeaderTextSplitter. Link
  • LangChain Experimental — SemanticChunker. Link
  • OpenAI — tiktoken tokenizer library. Link
  • Pinecone — Chunking Strategies for LLM Applications. Link
  • Related Tutorials