Skip to main content

Tokenizers Deep Dive: Build a BPE Tokenizer from Scratch in Python

Intermediate50 min3 exercises50 XP
0/3 exercises

Every LLM API charges per token — but what is a token? You're about to answer that by building the algorithm behind GPT's tokenizer: Byte-Pair Encoding, in about 50 lines of Python.

What Happens Before the Model Thinks

Tokenization splits text into subword pieces
Loading editor...

Before any attention heads fire or tokens get predicted, your text gets chopped into pieces called tokens. These pieces determine three things: how much you pay per API call, how much text fits in the context window, and how well the model handles your input. I assumed tokenizers used something sophisticated — maybe a dictionary or linguistic rules. The reality is a simple loop that merges common character pairs.

The BPE Algorithm Step by Step

How do you build a vocabulary that handles any word — even ones you've never seen? BPE starts with the smallest possible vocabulary: individual characters. Then it counts every adjacent pair, merges the most frequent one into a new token, and repeats.

Here's the algorithm on a tiny corpus. Each · separates tokens within a word:

BPE walkthrough on a tiny corpus
Loading editor...

Notice that "low" became a single token by merge 2 because it appeared frequently. The suffix "er" also got merged since it's common across "lower" and "newer." These reusable subword pieces are the heart of BPE — they let the tokenizer encode any word as a combination of learned fragments.

Build the Training Phase

I find building algorithms from scratch is the fastest way to internalize them. We'll implement BPE in four small functions — each under 15 lines. The full implementation is about 50 lines of pure Python.

Split Words Into Characters

BPE operates on words. Each word gets split into individual characters with an end-of-word marker </w> appended. The marker lets the tokenizer distinguish "low" at the end of a word from "low" inside "lower":

Split words into character tuples with end-of-word marker
Loading editor...

Four unique words, each split into characters. "low" appears 5 times, so its character tuple gets count 5. The </w> marker is critical — without it, the tokenizer can't tell where words end.

Implement get_word_frequencies
Write Code

Write get_word_frequencies(text) that splits text into words, converts each word into a tuple of characters plus '</w>', and returns a dictionary mapping each tuple to its count.

For example, get_word_frequencies("cat cat dog") should return {('c','a','t','</w>'): 2, ('d','o','g','</w>'): 1}.

Loading editor...

Count Adjacent Pairs

Next, scan every word and count how often each adjacent token pair appears, weighted by that word's frequency:

Count adjacent pair frequencies across the corpus
Loading editor...

The pairs (l, o) and (o, w) each appear 7 times — 5 from "low" and 2 from "lower." The algorithm will merge whichever comes first in the iteration order. Five other pairs tie at frequency 5.

Merge the Most Common Pair

The merge_pair function scans every word and replaces adjacent tokens matching the winning pair with a single merged token:

Merge the highest-frequency pair into a new token
Loading editor...

Every l + o in the vocabulary became lo. The word "low" is now lo·w·</w> instead of l·o·w·</w> — one token shorter. The algorithm repeats this process, compressing the vocabulary one merge at a time.

The Complete Training Loop

Now we chain everything together. The train_bpe function runs count_pairs → find best → merge for a given number of iterations:

Full BPE training loop: 10 merges on a tiny corpus
Loading editor...

Watch what happened: "low" became a single token low</w> by merge 3, and "newest" collapsed to newest</w> by merge 9. The suffix "est" was learned as a reusable piece — it appears in both "newest" and "widest." This is exactly how production tokenizers discover subword patterns from data.

Encode Text With Learned Merges

Training builds the vocabulary. Encoding uses it. The encode function takes a word, splits it into characters, then applies each learned merge rule in order:

Encode trained and unseen text with learned merges
Loading editor...

Trained words encode cleanly — "low" and "newest" are single tokens. Unseen words get split into the largest matching subwords: "slowly" reuses the learned "low" token plus individual characters for the rest. This graceful fallback is why BPE handles any input without an "unknown word" error.

Implement encode_word
Write Code

Write encode_word(word, merges) that takes a string and a list of merge rules, and returns the tokenized word as a list of strings.

Each merge rule is a tuple ((left, right), merged). Start by splitting the word into characters plus "</w>", then apply each merge in order.

Example: encode_word("low", [(("l","o"), "lo"), (("lo","w"), "low")])['low', '</w>']

Loading editor...

Your BPE vs Production Tokenizers

Our tokenizer trained on 12 words. GPT-4's tokenizer (tiktoken 0.5+, cl100k_base encoding) trained on billions. The algorithm is identical — the difference is scale. Here's what the comparison looks like:

Comparing your BPE output with GPT-4's tiktoken
Loading editor...
Both tokenizers agree that common English words like "low" and "newest" should be single tokens. The difference: tiktoken knows "lower" and "widest" as single tokens too, because it trained on enough data to see them millions of times. Our toy corpus only had 2 occurrences of each.
------------------------------------
Vocabulary size~20 tokens~100,000 tokens
Input levelCharactersBytes (0-255)
Training data12 wordsBillions of pages
SpeedPure PythonRust backend
Handles emoji/binaryNoYes

The Multilingual Tokenization Tax

Try tokenizing the same sentence in English, Chinese, and Hindi. The results reveal an important — and often invisible — bias built into most tokenizers:

Token efficiency across languages with GPT-4's tokenizer
Loading editor...

English averages about 0.25 tokens per character. Chinese and Hindi run 2-3x higher. The reason: GPT-4's tokenizer trained mostly on English text, so English byte sequences were merged aggressively into large tokens. Non-Latin scripts were seen less often, so their byte sequences stayed fragmented.

This has three concrete consequences:

  • Cost: The same content in Chinese costs roughly 2-3x more tokens than English
  • Context: A 128K-token window holds significantly less Chinese text
  • Quality: Fewer tokens of non-English text were seen during pre-training
  • Implement byte_count
    Write Code

    Write byte_count(text) that returns a dictionary with three keys:

  • 'chars': number of characters (len(text))
  • 'bytes': number of UTF-8 bytes (len(text.encode('utf-8')))
  • 'bytes_per_char': ratio of bytes to characters, rounded to 2 decimal places
  • This illustrates why non-ASCII text costs more to tokenize — each character takes more bytes, and byte-level tokenizers must process more input.

    Loading editor...

    Common Pitfalls When Working With Tokenizers

    Pitfall 1: Assuming 1 token ≈ 1 word. This holds roughly for English prose (~1.3 tokens per word) but breaks for code (2-3 tokens per word), non-Latin scripts (2-5 tokens per character), and text with heavy punctuation. I've seen teams blow their token budget by 3x because they estimated costs using word counts.

    Wrong: estimate by word count
    # Naive estimate
    words = len(text.split())
    estimated_tokens = words  # Wrong!
    # "def fibonacci(n):" → 1 word-ish
    # actual tokens: ["def", " fib", "onacci", "(", "n", "):", ] → 6 tokens
    Right: use the model's tokenizer
    # Accurate count
    import tiktoken
    enc = tiktoken.encoding_for_model("gpt-4")
    tokens = len(enc.encode(text))
    # For non-OpenAI models, use their
    # specific tokenizer library

    Pitfall 2: Using the wrong tokenizer for cost estimates. Each model family has its own tokenizer. tiktoken is OpenAI-only. Using it to estimate Claude or Llama costs gives wrong numbers — sometimes off by 20%. Anthropic provides token counts in their API response; Llama uses the transformers library.

    Pitfall 3: Forgetting tokenization affects prompt behavior. The model sees tokens, not your words. "don't" might be one token or two ("don" + "'t"). Hyphenated words split unpredictably. When a model seems to ignore part of your instruction, tokenization boundaries are sometimes the culprit.

    Summary and Next Steps

    You built a BPE tokenizer from scratch and saw that the "magic" is a loop merging common character pairs. The same algorithm powers GPT-4, Llama, and most modern LLMs — just at a much larger scale. You also discovered tokenization's hidden bias: English text gets efficient single-token encoding, while other languages pay a 2-3x tax.

    What to explore next:

  • LLM Sampling Parameters — Temperature, top-p, and top-k control how tokens get chosen
  • Context Windows Explained — Build a token-budget manager that tracks costs across languages
  • OpenAI API Crash Course — Apply tokenization knowledge to real API calls
  • Practice: Build a compare_tokenization(text, num_merges_list) function that trains BPE with different merge counts (e.g., 5, 10, 20) and shows how the tokenization of a sentence changes as vocabulary grows.

    Click to see the solution
    def compare_tokenization(text, corpus, num_merges_list):
        """Show how tokenization changes with vocabulary size."""
        for n in num_merges_list:
            merges, _ = train_bpe(corpus, n)
            tokens = encode_word(text, merges)
            print(f"  {n:2d} merges: {tokens} ({len(tokens)} tokens)")
    
    corpus = "low low low low low lower lower newest newest newest widest widest"
    print("Tokenizing 'newest' with different merge counts:\n")
    compare_tokenization("newest", corpus, [2, 5, 8, 10])

    You'll see "newest" go from 7 character tokens (0 merges) down to 1 token (9+ merges). More merges = more compression = fewer tokens per word.

    Complete Code

    Click to expand the full script (copy-paste and run)
    # Complete code from: Tokenizers Deep Dive — Build a BPE Tokenizer
    # No external dependencies — pure Python 3.9+
    
    # --- BPE Training Functions ---
    
    def get_word_frequencies(text):
        """Split text into character tuples with </w> and count frequencies."""
        word_freq = {}
        for word in text.split():
            chars = tuple(list(word) + ["</w>"])
            word_freq[chars] = word_freq.get(chars, 0) + 1
        return word_freq
    
    def count_pairs(word_freq):
        """Count adjacent token pair frequencies across all words."""
        pairs = {}
        for word, freq in word_freq.items():
            for i in range(len(word) - 1):
                pair = (word[i], word[i + 1])
                pairs[pair] = pairs.get(pair, 0) + freq
        return pairs
    
    def merge_pair(word_freq, pair_to_merge):
        """Replace every occurrence of pair_to_merge with merged token."""
        new_word_freq = {}
        merged = pair_to_merge[0] + pair_to_merge[1]
        for word, freq in word_freq.items():
            new_word = []
            i = 0
            while i < len(word):
                if (i < len(word) - 1
                    and word[i] == pair_to_merge[0]
                    and word[i + 1] == pair_to_merge[1]):
                    new_word.append(merged)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word_freq[tuple(new_word)] = freq
        return new_word_freq
    
    def train_bpe(text, num_merges):
        """Train BPE tokenizer, returning merge rules and final vocabulary."""
        word_freq = get_word_frequencies(text)
        merges = []
        for i in range(num_merges):
            pairs = count_pairs(word_freq)
            if not pairs:
                break
            best_pair = max(pairs.items(), key=lambda x: x[1])[0]
            best_freq = pairs[best_pair]
            merged_token = best_pair[0] + best_pair[1]
            merges.append((best_pair, merged_token))
            word_freq = merge_pair(word_freq, best_pair)
            print(f"  Merge {i+1:2d}: {best_pair[0]:8s} + {best_pair[1]:8s} -> {merged_token:12s} (freq {best_freq})")
        return merges, word_freq
    
    # --- BPE Encoding Functions ---
    
    def encode_word(word, merges):
        """Tokenize a single word using learned merge rules."""
        tokens = list(word) + ["</w>"]
        for pair, merged in merges:
            new_tokens = []
            i = 0
            while i < len(tokens):
                if (i < len(tokens) - 1
                    and tokens[i] == pair[0]
                    and tokens[i + 1] == pair[1]):
                    new_tokens.append(merged)
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            tokens = new_tokens
        return tokens
    
    def encode_text(text, merges):
        """Tokenize a full string."""
        return {word: encode_word(word, merges) for word in text.split()}
    
    # --- Demo ---
    
    corpus = "low low low low low lower lower newest newest newest widest widest"
    print("Training BPE with 10 merges:\n")
    merges, final_vocab = train_bpe(corpus, num_merges=10)
    
    print(f"\nFinal vocabulary:")
    for word, freq in final_vocab.items():
        print(f"  {chr(183).join(word):20s} -> {freq}")
    
    print("\nEncoding trained words:")
    for word, tokens in encode_text("low newest lower widest", merges).items():
        print(f"  {word:10s} -> {tokens}")
    
    print("\nEncoding unseen words:")
    for word, tokens in encode_text("slowly flowing unknown", merges).items():
        print(f"  {word:10s} -> {tokens}")
    
    print("\nScript completed successfully.")

    Frequently Asked Questions

    What's the difference between BPE and WordPiece?

    Both are subword tokenization algorithms. BPE merges the most frequent pair. WordPiece (used by BERT) merges the pair that maximizes training data likelihood — a slightly more sophisticated criterion. In practice they produce similar vocabularies. BPE dominates in generative models (GPT, Llama); WordPiece is more common in encoder models (BERT, DistilBERT).

    Can I use tiktoken for Claude or Llama?

    No — tiktoken is OpenAI-specific. For Claude, Anthropic returns token counts in the API response (usage.input_tokens). For Llama and other Hugging Face models, use the transformers library: AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b"). Each model has its own tokenizer trained on different data.

    How do special tokens like <|endoftext|> work?

    Special tokens are added outside the BPE merge process. They mark boundaries: end-of-document, start-of-response, tool calls, system prompts. Token IDs above the main vocabulary range (e.g., above 100,000 for GPT-4) are typically special tokens. You can't produce them by typing — they're injected by the API.

    Why not just use whole words as tokens?

    A word-level vocabulary would need millions of entries (every word form in every language) and still couldn't handle typos, neologisms, or code. BPE's subword approach keeps the vocabulary around 100K tokens while covering any input. The tradeoff: common words get 1 token, rare words get split into 2-5 pieces.

    References

  • Gage, P. — "A New Algorithm for Data Compression." The C Users Journal, 1994. (Original BPE paper)
  • Sennrich, R. et al. — "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. arXiv:1508.07909
  • OpenAI — tiktoken tokenizer library. GitHub
  • Kudo, T. & Richardson, J. — "SentencePiece: A simple and language independent subword tokenizer." EMNLP 2018. arXiv:1808.06226
  • Petrov, A. et al. — "Language Model Tokenizers Introduce Unfairness Between Languages." arXiv 2023. arXiv:2305.15425
  • Hugging Face — Tokenizers documentation. Link
  • OpenAI — GPT-4 Technical Report. arXiv:2303.08774
  • Related Tutorials