Skip to main content

LangChain Text Splitters: Chunking Strategies for Optimal Retrieval Quality

Intermediate60 min2 exercises60 XP
0/2 exercises

You loaded 200 pages of documentation into your RAG pipeline and asked a specific question. The answer came back vague, stitched together from unrelated paragraphs. The problem isn't your embedding model or your vector store — it's how you split the text. Bad chunks produce bad retrievals, and no amount of prompt engineering can fix that.

This tutorial walks through every major text splitting strategy in LangChain, shows you exactly when each one shines (and when it fails), and gives you a framework to measure chunk quality before you ever plug it into a retrieval pipeline.

Why Text Splitting Matters for RAG

When you embed a document for retrieval-augmented generation (RAG), you're converting text into numerical vectors that capture semantic meaning. But you can't embed an entire 50-page PDF as one vector — the meaning gets diluted. A single vector trying to represent 50 pages of content captures nothing well.

So you split the document into smaller pieces — chunks — and embed each one separately. When a user asks a question, you find the chunks whose vectors are closest to the question's vector, then feed those chunks to the LLM as context.

I've seen teams spend weeks tuning their embedding model when the real problem was that their chunks were splitting sentences in half, or mixing two unrelated topics in the same chunk. The quality of your splits directly determines the quality of your retrievals.


LangChain provides several text splitters, each designed for different document types and use cases. We'll work through them from simplest to most sophisticated — starting with the naive baseline so you understand what it gets wrong, then moving to the one I reach for 80% of the time.

CharacterTextSplitter — The Naive Baseline

The simplest splitter cuts text at a fixed character count. It's the equivalent of slicing a string every N characters — fast, predictable, and often terrible for retrieval quality. I'm showing it first because understanding why it fails teaches you what to look for in better splitters.

The code below creates a CharacterTextSplitter that splits a short ML overview on newline characters (separator="\n"), keeping each chunk under 120 characters. The chunk_overlap=0 setting means no text is shared between adjacent chunks.

CharacterTextSplitter — splitting on newlines
Loading editor...

The output shows chunks split at every newline boundary, with each chunk kept under 120 characters. The separator parameter controls where the splitter is allowed to cut. With "\n", it can only cut at line breaks.

RecursiveCharacterTextSplitter — The Default Choice

This is the splitter I use by default, and the LangChain docs recommend it too. Instead of a single separator, it tries a hierarchy of separators — splitting first by paragraphs, then by sentences, then by words, and finally by characters. It works through the list until chunks are small enough.

The example below passes a multi-section ML document through RecursiveCharacterTextSplitter with a 250-character limit and 50-character overlap. The separators list defines the split hierarchy: double-newlines first (paragraph breaks), then single newlines, sentence boundaries, spaces, and finally individual characters as a last resort.

RecursiveCharacterTextSplitter with default separators
Loading editor...

The separator hierarchy ["\n\n", "\n", ". ", " ", ""] tells the splitter: "First try splitting on blank lines (paragraph breaks). If a chunk is still too big, try single newlines. Still too big? Try sentence boundaries. Then word boundaries. As a last resort, split on every character."

How the Recursive Logic Works Step by Step

The recursive algorithm has a simple loop: try the first separator, check if all resulting pieces are small enough, and if any piece is still too big, apply the next separator to just that piece. The code below traces this on a short sample string — three paragraphs separated by \n\n, where the second paragraph exceeds the 60-character limit and needs a finer-grained split on ". " boundaries.

Tracing the recursive splitting logic
Loading editor...

The beauty of this approach: it always prefers the largest natural boundary. Your chunks end at paragraph breaks when possible, sentence breaks when necessary, and word breaks only as a fallback. This preserves semantic coherence far better than fixed-character splitting.

Chunk Size and Overlap — The Parameters That Matter Most

Choosing chunk_size and chunk_overlap is the single most impactful decision in your chunking pipeline. I've watched people agonize over embedding models when their chunk size was the actual bottleneck.

To see the effect concretely, the code below runs a simulated 10,000-character document through four different chunk configurations — from aggressive 100-character chunks up to coarse 2000-character chunks — and prints how many pieces each one produces.

Comparing chunk sizes on the same document
Loading editor...

More chunks mean more embeddings to store and search through. Fewer chunks mean each embedding represents a larger, less focused region of text.

What Happens When Chunks Are Too Small

When chunks are too small, individual chunks lack context. A chunk that says "It achieves 95% accuracy" without mentioning what "it" refers to is useless for retrieval. The embedding captures the idea of accuracy, but the LLM receiving this chunk can't produce a meaningful answer.

The contrast becomes obvious when you split the same paragraph at 80 characters versus 250 characters. The small splitter orphans the accuracy claim from its subject ("random forests"), while the larger splitter keeps the entire concept together as a single retrievable unit.

Too-small chunks lose meaning
Loading editor...

With 80-character chunks, "It achieves 95% accuracy" lands in its own chunk — orphaned from the subject. With 250-character chunks, the full concept of random forests and their accuracy stays together.

Token-Based Splitting — Aligning with LLM Context Windows

Character counts are a rough proxy for chunk size, but LLMs don't consume characters — they consume tokens. A 500-character chunk might be 100 tokens or 150, depending on the text. When you need precise control over how many tokens each chunk uses in your LLM's context window, split by tokens directly.

The from_tiktoken_encoder class method creates a splitter that uses OpenAI's tiktoken library to count tokens instead of characters. The code below sets a 100-token limit with 20-token overlap, using the cl100k_base encoding that matches GPT-4 and GPT-3.5-turbo. Expect each chunk to be roughly 400-500 characters — but the exact boundary depends on token boundaries, not character counts.

Token-based splitting with tiktoken
Loading editor...

When should you use token-based splitting over character-based? In my experience, it matters most when you're packing retrieved chunks into a prompt with a tight token budget. If your LLM context window is 4096 tokens and you're reserving 1000 for the system prompt and response, you need to know exactly how many tokens each chunk consumes.

Structure-Aware Splitting — Markdown and HTML

Technical documentation, README files, and web content have explicit structure — headings, code blocks, lists. A generic splitter ignores all of this. MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter use document structure as the primary splitting signal, which is a massive upgrade for structured documents.

MarkdownHeaderTextSplitter

This splitter parses markdown headings and splits the document at each heading boundary. You tell it which heading levels to split on via headers_to_split_on — a list of (marker, label) tuples. Each resulting chunk gets metadata recording its position in the heading hierarchy, so you know that a chunk came from "Machine Learning Guide > Supervised Learning > Classification."

The code below splits a nested ML guide on #, ##, and ### headings, then prints each chunk's content alongside its metadata. You'll see that the splitter groups all content under a heading into a single chunk.

MarkdownHeaderTextSplitter preserves heading hierarchy
Loading editor...

Each chunk gets metadata showing exactly where it falls in the document hierarchy — {"h1": "Machine Learning Guide", "h2": "Supervised Learning", "h3": "Classification"}. This metadata is gold for retrieval. When a user asks about classification, you can filter by metadata before doing semantic search, or include the heading path in the chunk text for richer embeddings.

The two-step approach below handles this: first split by headings to capture structure, then enforce a 250-character size limit on each section. The heading metadata carries through to the final chunks, so you get both size control and structural context. This is my go-to pipeline for any markdown-based documentation.

Chaining MarkdownHeaderTextSplitter with RecursiveCharacterTextSplitter
Loading editor...

HTMLHeaderTextSplitter

The HTML variant works the same way but uses <h1>, <h2>, <h3> tags instead of markdown # markers. If you're ingesting web-scraped content or HTML API documentation (loaded via LangChain document loaders), this splitter preserves the page structure that RecursiveCharacterTextSplitter alone would ignore.

The code below splits an HTML API reference into chunks at each heading boundary. Each chunk inherits metadata recording its heading path — {"h1": "API Reference", "h2": "Endpoints", "h3": "GET /users"} — so you can filter retrievals by API section.

HTMLHeaderTextSplitter for web content
Loading editor...

Code-Aware Splitting — Language-Specific Separators

Splitting source code with a generic text splitter is a recipe for disaster. A function split in half is worse than useless — it's misleading. LangChain provides language-specific separators that understand code structure.

The Language.PYTHON setting below replaces the generic separator hierarchy with Python-specific separators: class definitions, def declarations, and decorators become the primary split points. The splitter tries to keep each function or method as a complete unit within the 200-character limit.

Language-aware splitting for Python code
Loading editor...

JSON Splitting — Preserving Nested Structure

API responses, configuration files, and data exports often come as JSON. Splitting JSON with a generic text splitter will break the nesting structure — you might end up with an opening brace in one chunk and the closing brace in another, producing invalid JSON fragments that confuse both embeddings and downstream parsing.

LangChain's RecursiveJsonSplitter traverses the JSON tree and splits at natural boundaries — object keys at a given nesting depth — while keeping each chunk as valid JSON. The max_chunk_size parameter controls how large (in characters) each resulting JSON fragment can be.

RecursiveJsonSplitter preserves valid JSON structure
Loading editor...

Each output chunk is valid JSON that preserves the nesting context. If you're building a RAG system over API documentation or structured data exports, this is far more reliable than trying to split the serialized JSON string with RecursiveCharacterTextSplitter.

Semantic Splitting — Letting Embeddings Decide Where to Cut

Every splitter we've seen so far uses rules — character counts, separators, heading tags. Semantic splitting takes a fundamentally different approach: it uses an embedding model to detect where the topic changes, and splits there.

The code below uses SemanticChunker from langchain_experimental with OpenAI embeddings. The breakpoint_threshold_type="percentile" setting means it splits wherever the inter-sentence similarity falls below the 25th percentile (the bottom quartile of all similarities). The sample text has three clear topic shifts — Python history, machine learning, and transformers — so we expect three chunks.

SemanticChunker detects topic boundaries
Loading editor...

The breakpoint_threshold_type parameter controls how the splitter decides where to cut. Each mode uses a different statistical method to identify "significant" drops in similarity.

Comparing Splitters Side by Side

Knowing each splitter individually is useful. Knowing when to pick one over another is what actually matters. The code below runs three splitters — CharacterTextSplitter, RecursiveCharacterTextSplitter, and MarkdownHeaderTextSplitter — on the same 700-character data preprocessing document, then prints chunk count, average length, min/max, and standard deviation for each.

Side-by-side comparison of splitter behaviors
Loading editor...

Running this on the same document reveals the differences clearly. CharacterTextSplitter produces the most uniform chunks but splits mid-sentence. RecursiveCharacterTextSplitter respects sentence boundaries. MarkdownHeaderTextSplitter produces the most semantically coherent chunks but with variable sizes.

Real-World Pipeline: Splitting a Technical Documentation Set

Time to put everything together. You're building a RAG system over a set of markdown documentation files. Each file has headings, code blocks, and prose. You need chunks that are small enough for precise retrieval, tagged with their source location, and bounded to a manageable size.

The split_markdown_docs function below implements a three-step pipeline: first, split each document by headings using MarkdownHeaderTextSplitter to capture structure; second, enforce an 800-character size limit using RecursiveCharacterTextSplitter; third, attach the source filename as metadata so every chunk traces back to its origin file.

Production-ready markdown splitting pipeline
Loading editor...

Here's how to call the pipeline on a sample installation guide. Each output chunk carries metadata — the filename, the heading hierarchy, and a text body bounded to 800 characters. This is exactly what you'd feed to an embedding model and then into a vector store.

Running the pipeline on a sample document
Loading editor...

The metadata enables filtered retrieval — "only search within the Installation Guide" — without re-embedding. This matters when your corpus grows to hundreds of documents.

Document-Type Routing — Picking the Right Splitter Automatically

In production, your ingestion pipeline receives different document types: markdown docs, HTML pages, Python source files, JSON configs. Using a single splitter for all of them means some will chunk poorly. A routing function that picks the right splitter based on file extension solves this cleanly.

Document-type routing function
Loading editor...

This pattern is how I wire up every production ingestion pipeline. The 20% overlap rule applies by default, and each document type gets the splitter that understands its structure. Extend the elif branches as you add new document types to your corpus.


Build a Chunk Quality Analyzer
Write Code

Write a function analyze_chunks(texts: list[str]) -> dict that takes a list of text chunks and returns a dictionary with these statistics:

  • "count": total number of chunks
  • "avg_length": average character length (rounded to nearest integer)
  • "min_length": length of the shortest chunk
  • "max_length": length of the longest chunk
  • "short_chunks": count of chunks with fewer than 50 characters
  • "long_chunks": count of chunks with more than 500 characters
  • This is the kind of quality check you'd run after splitting documents to catch bad configurations.

    Loading editor...

    Common Mistakes and How to Fix Them

    I've debugged chunking pipelines for enough projects to see the same mistakes repeated. Here are the top five.

    Mistake 1: Zero Overlap

    Zero overlap — sentences split at boundaries
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=200,
        chunk_overlap=0,  # no overlap!
    )
    # Chunk 1 ends: "...the model achieves"
    # Chunk 2 starts: "95% accuracy on test data."
    # Neither chunk has the full sentence.
    With overlap — boundary sentences preserved
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=200,
        chunk_overlap=40,  # 20% overlap
    )
    # Chunk 1 ends: "...the model achieves 95% accuracy on test data."
    # Chunk 2 starts: "achieves 95% accuracy on test data. The next..."
    # Full sentence exists in at least one chunk.

    Mistake 2: Using Character Splitting for Structured Documents

    If your document has headings, use a structure-aware splitter. Splitting a markdown doc with RecursiveCharacterTextSplitter alone throws away all the heading context that would make your chunks more searchable.

    Mistake 3: Not Inspecting Your Chunks

    This is the most common mistake. People configure a splitter, pipe the chunks directly to an embedding model, and never look at what the chunks actually contain. Always print a sample of your chunks and read them. If a chunk doesn't make sense to you when you read it in isolation, it won't make sense to the embedding model either.

    Always inspect a sample of your chunks
    Loading editor...

    Mistake 4: Chunk Size Way Too Small

    I see beginners set chunk_size=100 thinking smaller chunks mean more precise retrieval. In practice, 100-character chunks lack context and produce poor embeddings. Unless you're working with very short, self-contained items (like FAQ answers), keep chunks above 200 characters.

    Mistake 5: Ignoring Metadata

    Structure-aware splitters produce metadata (heading hierarchy, source file) that you should preserve and index. Metadata enables filtered retrieval — searching only within a specific document section — which is often more effective than pure vector similarity across your entire corpus.


    Implement a Simple Recursive Text Splitter
    Write Code

    Implement a function recursive_split(text: str, max_size: int, separators: list[str]) -> list[str] that mimics the recursive splitting logic.

    Rules:

    1. Try each separator in order. Split the text using the first separator that appears in the text.

    2. After splitting, check each piece. If a piece is within max_size, keep it.

    3. If a piece exceeds max_size, recursively split it using the remaining separators (starting from the next one).

    4. If no separators work (or the list is exhausted), return the text as-is (even if it exceeds max_size).

    5. Remove empty strings from the result.

    Do NOT implement overlap — just the core recursive logic.

    Loading editor...

    Complete Code

    Copy the script below into a local Python file and run it to see every splitter in action. It covers CharacterTextSplitter, RecursiveCharacterTextSplitter, token-based splitting, MarkdownHeaderTextSplitter, and code-aware splitting — all on the same sample documents.

    Full script — copy and run locally
    Loading editor...

    Frequently Asked Questions

    How do I choose between character-based and token-based splitting?

    Use character-based splitting as your default — it's faster and simpler. Switch to token-based splitting when you need precise control over how many tokens each chunk consumes in your LLM's context window. This matters when your prompt template has a tight token budget and you need to fit exactly N chunks.

    Can I create a custom splitter for my document format?

    Yes. Subclass TextSplitter from langchain_text_splitters and override the split_text method. For most cases, though, you can get away with customizing the separators list on RecursiveCharacterTextSplitter.

    Custom separators for non-standard document formats
    Loading editor...

    How does chunk overlap affect storage and cost?

    Overlap duplicates text across chunks, which means more chunks total. With 20% overlap, you get roughly 25% more chunks than with zero overlap. This increases embedding storage and vector search time proportionally. For most applications this is a worthwhile tradeoff — the improvement in retrieval quality from not losing boundary context outweighs the extra cost.

    Should I include metadata in the chunk text before embedding?

    Sometimes. Prepending the heading hierarchy to the chunk text (e.g., "Machine Learning Guide > Supervised Learning > Classification: ...") gives the embedding model more context. This helps when section headings carry important topical information. But it also increases chunk length and can dilute the semantic signal for very short chunks.

    What chunk size should I use for factual Q&A versus analytical queries?

    Factual queries ("What is the default learning rate?") work best with smaller chunks (256-512 tokens) because the answer is a single fact you want to pinpoint. Analytical queries ("Compare the pros and cons of batch normalization") need larger chunks (1024+ tokens) because the answer requires surrounding context. If your RAG system handles both, consider splitting at a moderate size (512 tokens) and retrieving more chunks for analytical queries.


    What's Next

    Text splitting is just one stage in a RAG pipeline. Once you have well-formed chunks, you need to embed them and store them for retrieval. Here's the natural learning path from here:

  • Embedding Models for RAG — benchmark OpenAI, Cohere, and open-source embedding models to find the best fit for your chunks
  • Vector Databases Comparison — compare FAISS, ChromaDB, Pinecone, and Qdrant for storing and searching your embedded chunks
  • Build RAG with LangChain — put it all together: document loading, splitting, embedding, retrieval, and answer generation in one pipeline
  • RAG Explained — if you want to understand the fundamentals of why RAG works before diving into implementation
  • References

  • LangChain Documentation — Text Splitters. Link
  • LangChain API Reference — RecursiveCharacterTextSplitter. Link
  • LangChain Documentation — MarkdownHeaderTextSplitter. Link
  • LangChain Documentation — HTMLHeaderTextSplitter. Link
  • LangChain Experimental — SemanticChunker. Link
  • OpenAI — tiktoken tokenizer library. Link
  • Pinecone — Chunking Strategies for LLM Applications. Link
  • Related Tutorials

    Save your progress across devices

    Never lose your code, challenges, or XP. Sign up free — no password needed.

    Already have an account?