Tokenizers Deep Dive: Build a BPE Tokenizer from Scratch in Python
Every LLM API charges per token — but what is a token? You're about to answer that by building the algorithm behind GPT's tokenizer: Byte-Pair Encoding, in about 50 lines of Python.
What Happens Before the Model Thinks
Before any attention heads fire or tokens get predicted, your text gets chopped into pieces called tokens. These pieces determine three things: how much you pay per API call, how much text fits in the context window, and how well the model handles your input. I assumed tokenizers used something sophisticated — maybe a dictionary or linguistic rules. The reality is a simple loop that merges common character pairs.
The BPE Algorithm Step by Step
How do you build a vocabulary that handles any word — even ones you've never seen? BPE starts with the smallest possible vocabulary: individual characters. Then it counts every adjacent pair, merges the most frequent one into a new token, and repeats.
Here's the algorithm on a tiny corpus. Each · separates tokens within a word:
Notice that "low" became a single token by merge 2 because it appeared frequently. The suffix "er" also got merged since it's common across "lower" and "newer." These reusable subword pieces are the heart of BPE — they let the tokenizer encode any word as a combination of learned fragments.
Build the Training Phase
I find building algorithms from scratch is the fastest way to internalize them. We'll implement BPE in four small functions — each under 15 lines. The full implementation is about 50 lines of pure Python.
Split Words Into Characters
BPE operates on words. Each word gets split into individual characters with an end-of-word marker </w> appended. The marker lets the tokenizer distinguish "low" at the end of a word from "low" inside "lower":
Four unique words, each split into characters. "low" appears 5 times, so its character tuple gets count 5. The </w> marker is critical — without it, the tokenizer can't tell where words end.
Write get_word_frequencies(text) that splits text into words, converts each word into a tuple of characters plus '</w>', and returns a dictionary mapping each tuple to its count.
For example, get_word_frequencies("cat cat dog") should return {('c','a','t','</w>'): 2, ('d','o','g','</w>'): 1}.
Count Adjacent Pairs
Next, scan every word and count how often each adjacent token pair appears, weighted by that word's frequency:
The pairs (l, o) and (o, w) each appear 7 times — 5 from "low" and 2 from "lower." The algorithm will merge whichever comes first in the iteration order. Five other pairs tie at frequency 5.
Merge the Most Common Pair
The merge_pair function scans every word and replaces adjacent tokens matching the winning pair with a single merged token:
Every l + o in the vocabulary became lo. The word "low" is now lo·w·</w> instead of l·o·w·</w> — one token shorter. The algorithm repeats this process, compressing the vocabulary one merge at a time.
The Complete Training Loop
Now we chain everything together. The train_bpe function runs count_pairs → find best → merge for a given number of iterations:
Watch what happened: "low" became a single token low</w> by merge 3, and "newest" collapsed to newest</w> by merge 9. The suffix "est" was learned as a reusable piece — it appears in both "newest" and "widest." This is exactly how production tokenizers discover subword patterns from data.
Encode Text With Learned Merges
Training builds the vocabulary. Encoding uses it. The encode function takes a word, splits it into characters, then applies each learned merge rule in order:
Trained words encode cleanly — "low" and "newest" are single tokens. Unseen words get split into the largest matching subwords: "slowly" reuses the learned "low" token plus individual characters for the rest. This graceful fallback is why BPE handles any input without an "unknown word" error.
Write encode_word(word, merges) that takes a string and a list of merge rules, and returns the tokenized word as a list of strings.
Each merge rule is a tuple ((left, right), merged). Start by splitting the word into characters plus "</w>", then apply each merge in order.
Example: encode_word("low", [(("l","o"), "lo"), (("lo","w"), "low")]) → ['low', '</w>']
Your BPE vs Production Tokenizers
Our tokenizer trained on 12 words. GPT-4's tokenizer (tiktoken 0.5+, cl100k_base encoding) trained on billions. The algorithm is identical — the difference is scale. Here's what the comparison looks like:
| Both tokenizers agree that common English words like "low" and "newest" should be single tokens. The difference: tiktoken knows "lower" and "widest" as single tokens too, because it trained on enough data to see them millions of times. Our toy corpus only had 2 occurrences of each. | ||
|---|---|---|
| --------- | --------- | ------------------ |
| Vocabulary size | ~20 tokens | ~100,000 tokens |
| Input level | Characters | Bytes (0-255) |
| Training data | 12 words | Billions of pages |
| Speed | Pure Python | Rust backend |
| Handles emoji/binary | No | Yes |
The Multilingual Tokenization Tax
Try tokenizing the same sentence in English, Chinese, and Hindi. The results reveal an important — and often invisible — bias built into most tokenizers:
English averages about 0.25 tokens per character. Chinese and Hindi run 2-3x higher. The reason: GPT-4's tokenizer trained mostly on English text, so English byte sequences were merged aggressively into large tokens. Non-Latin scripts were seen less often, so their byte sequences stayed fragmented.
This has three concrete consequences:
Write byte_count(text) that returns a dictionary with three keys:
'chars': number of characters (len(text))'bytes': number of UTF-8 bytes (len(text.encode('utf-8')))'bytes_per_char': ratio of bytes to characters, rounded to 2 decimal placesThis illustrates why non-ASCII text costs more to tokenize — each character takes more bytes, and byte-level tokenizers must process more input.
Common Pitfalls When Working With Tokenizers
Pitfall 1: Assuming 1 token ≈ 1 word. This holds roughly for English prose (~1.3 tokens per word) but breaks for code (2-3 tokens per word), non-Latin scripts (2-5 tokens per character), and text with heavy punctuation. I've seen teams blow their token budget by 3x because they estimated costs using word counts.
# Naive estimate
words = len(text.split())
estimated_tokens = words # Wrong!
# "def fibonacci(n):" → 1 word-ish
# actual tokens: ["def", " fib", "onacci", "(", "n", "):", ] → 6 tokens# Accurate count
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = len(enc.encode(text))
# For non-OpenAI models, use their
# specific tokenizer libraryPitfall 2: Using the wrong tokenizer for cost estimates. Each model family has its own tokenizer. tiktoken is OpenAI-only. Using it to estimate Claude or Llama costs gives wrong numbers — sometimes off by 20%. Anthropic provides token counts in their API response; Llama uses the transformers library.
Pitfall 3: Forgetting tokenization affects prompt behavior. The model sees tokens, not your words. "don't" might be one token or two ("don" + "'t"). Hyphenated words split unpredictably. When a model seems to ignore part of your instruction, tokenization boundaries are sometimes the culprit.
Summary and Next Steps
You built a BPE tokenizer from scratch and saw that the "magic" is a loop merging common character pairs. The same algorithm powers GPT-4, Llama, and most modern LLMs — just at a much larger scale. You also discovered tokenization's hidden bias: English text gets efficient single-token encoding, while other languages pay a 2-3x tax.
What to explore next:
Practice: Build a compare_tokenization(text, num_merges_list) function that trains BPE with different merge counts (e.g., 5, 10, 20) and shows how the tokenization of a sentence changes as vocabulary grows.
Click to see the solution
def compare_tokenization(text, corpus, num_merges_list):
"""Show how tokenization changes with vocabulary size."""
for n in num_merges_list:
merges, _ = train_bpe(corpus, n)
tokens = encode_word(text, merges)
print(f" {n:2d} merges: {tokens} ({len(tokens)} tokens)")
corpus = "low low low low low lower lower newest newest newest widest widest"
print("Tokenizing 'newest' with different merge counts:\n")
compare_tokenization("newest", corpus, [2, 5, 8, 10])You'll see "newest" go from 7 character tokens (0 merges) down to 1 token (9+ merges). More merges = more compression = fewer tokens per word.
Complete Code
Click to expand the full script (copy-paste and run)
# Complete code from: Tokenizers Deep Dive — Build a BPE Tokenizer
# No external dependencies — pure Python 3.9+
# --- BPE Training Functions ---
def get_word_frequencies(text):
"""Split text into character tuples with </w> and count frequencies."""
word_freq = {}
for word in text.split():
chars = tuple(list(word) + ["</w>"])
word_freq[chars] = word_freq.get(chars, 0) + 1
return word_freq
def count_pairs(word_freq):
"""Count adjacent token pair frequencies across all words."""
pairs = {}
for word, freq in word_freq.items():
for i in range(len(word) - 1):
pair = (word[i], word[i + 1])
pairs[pair] = pairs.get(pair, 0) + freq
return pairs
def merge_pair(word_freq, pair_to_merge):
"""Replace every occurrence of pair_to_merge with merged token."""
new_word_freq = {}
merged = pair_to_merge[0] + pair_to_merge[1]
for word, freq in word_freq.items():
new_word = []
i = 0
while i < len(word):
if (i < len(word) - 1
and word[i] == pair_to_merge[0]
and word[i + 1] == pair_to_merge[1]):
new_word.append(merged)
i += 2
else:
new_word.append(word[i])
i += 1
new_word_freq[tuple(new_word)] = freq
return new_word_freq
def train_bpe(text, num_merges):
"""Train BPE tokenizer, returning merge rules and final vocabulary."""
word_freq = get_word_frequencies(text)
merges = []
for i in range(num_merges):
pairs = count_pairs(word_freq)
if not pairs:
break
best_pair = max(pairs.items(), key=lambda x: x[1])[0]
best_freq = pairs[best_pair]
merged_token = best_pair[0] + best_pair[1]
merges.append((best_pair, merged_token))
word_freq = merge_pair(word_freq, best_pair)
print(f" Merge {i+1:2d}: {best_pair[0]:8s} + {best_pair[1]:8s} -> {merged_token:12s} (freq {best_freq})")
return merges, word_freq
# --- BPE Encoding Functions ---
def encode_word(word, merges):
"""Tokenize a single word using learned merge rules."""
tokens = list(word) + ["</w>"]
for pair, merged in merges:
new_tokens = []
i = 0
while i < len(tokens):
if (i < len(tokens) - 1
and tokens[i] == pair[0]
and tokens[i + 1] == pair[1]):
new_tokens.append(merged)
i += 2
else:
new_tokens.append(tokens[i])
i += 1
tokens = new_tokens
return tokens
def encode_text(text, merges):
"""Tokenize a full string."""
return {word: encode_word(word, merges) for word in text.split()}
# --- Demo ---
corpus = "low low low low low lower lower newest newest newest widest widest"
print("Training BPE with 10 merges:\n")
merges, final_vocab = train_bpe(corpus, num_merges=10)
print(f"\nFinal vocabulary:")
for word, freq in final_vocab.items():
print(f" {chr(183).join(word):20s} -> {freq}")
print("\nEncoding trained words:")
for word, tokens in encode_text("low newest lower widest", merges).items():
print(f" {word:10s} -> {tokens}")
print("\nEncoding unseen words:")
for word, tokens in encode_text("slowly flowing unknown", merges).items():
print(f" {word:10s} -> {tokens}")
print("\nScript completed successfully.")Frequently Asked Questions
What's the difference between BPE and WordPiece?
Both are subword tokenization algorithms. BPE merges the most frequent pair. WordPiece (used by BERT) merges the pair that maximizes training data likelihood — a slightly more sophisticated criterion. In practice they produce similar vocabularies. BPE dominates in generative models (GPT, Llama); WordPiece is more common in encoder models (BERT, DistilBERT).
Can I use tiktoken for Claude or Llama?
No — tiktoken is OpenAI-specific. For Claude, Anthropic returns token counts in the API response (usage.input_tokens). For Llama and other Hugging Face models, use the transformers library: AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b"). Each model has its own tokenizer trained on different data.
How do special tokens like <|endoftext|> work?
Special tokens are added outside the BPE merge process. They mark boundaries: end-of-document, start-of-response, tool calls, system prompts. Token IDs above the main vocabulary range (e.g., above 100,000 for GPT-4) are typically special tokens. You can't produce them by typing — they're injected by the API.
Why not just use whole words as tokens?
A word-level vocabulary would need millions of entries (every word form in every language) and still couldn't handle typos, neologisms, or code. BPE's subword approach keeps the vocabulary around 100K tokens while covering any input. The tradeoff: common words get 1 token, rare words get split into 2-5 pieces.