Context Windows Explained: Build a Smart Token-Budget Manager in Python

Intermediate60 min3 exercises55 XP

Prerequisites

0/3 exercises

You paste a 50-page PDF into ChatGPT and get back: "This conversation has exceeded the model's maximum context length." That wall you just hit is the context window — and understanding it changes how you design every LLM application.

What Is a Context Window and Why Should You Care?

Picture a desk with a fixed surface area. Everything the model reads and writes must fit on that desk — your system prompt, the conversation history, the user's latest message, and the model's response. That desk is the context window, measured in tokens.

Different models have wildly different desk sizes. GPT-4o gives you 128K tokens. Claude 3.5 offers 200K. Gemini 1.5 Pro stretches to 2 million. Picking the right model for your use case starts with knowing these numbers.

Provider context window comparison

Loading editor...

Context windows keep growing, but they are never infinite. Even with Gemini's 2M tokens, a single large codebase can exceed the limit. Larger contexts also mean higher latency and cost — stuffing everything in is rarely the right call.

Counting Tokens — The Real Currency

Build a simple token estimator

Loading editor...

Word count alone is misleading. Code and URLs contain punctuation and symbols that each become separate tokens. Our estimate_tokens function uses a 1.3x multiplier on word count, which is a reasonable approximation for English prose.

Estimate token cost for real documents

Loading editor...

A 50-page PDF at ~250 words per page uses roughly 16,250 tokens. That fits easily in any modern context window. But add conversation history, system prompts, and room for a response — the budget fills up faster than you'd expect.

Build estimate_tokens()

Write Code

Write estimate_tokens(text) that estimates the token count of a string. Use the formula: split on whitespace, multiply the word count by 1.3, and round up with math.ceil. Return an integer.

Return 0 for empty or whitespace-only strings.

Loading editor...

The Token Budget — Input + Output = Context

How much of the context window does your response actually get? This is the question most developers forget to ask. The context window is shared: input tokens (your prompt) and output tokens (the model's reply) must fit together inside it.

Visualize the token budget breakdown

Loading editor...

In this scenario, 123K tokens of input leave only 5K for the reply — but max_tokens caps it at 4,096. The model gets barely 3.2% of the window for its response. Long conversations silently erode the output budget until the model starts cutting its own answers short.

Quick Check: calculate the input budget

Loading editor...

Truncation Strategies

Your text exceeds the budget and something has to go. The question isn't whether to truncate — it's where to cut. Three strategies cover the vast majority of use cases, and each works best for a different type of content.

Head Truncation (Keep the Beginning)

Head truncation keeps the first N tokens and drops the rest. This works well for documents where the most important information appears early — executive summaries, news articles, and structured reports with key findings up front.

Head truncation — keep the start

Loading editor...

Tail Truncation (Keep the End)

Tail truncation keeps the last N tokens and drops the beginning. This is the right choice for chat history — recent messages matter more than older ones. It's also useful for log files where the latest entries carry the most signal.

Tail truncation — keep the end

Loading editor...

Smart-Middle Truncation (Keep Start + End)

Smart-middle truncation preserves the beginning and end while cutting from the middle. This keeps context from both the opening (which often sets the scene) and the conclusion (which often holds the answer). Research shows LLMs attend most strongly to these positions.

Smart-middle truncation — keep start and end

Loading editor...

Naive: cut at character limit

def naive_truncate(text, max_chars=500):
    """Cuts mid-word, ignores token count."""
    return text[:max_chars]

# Problems:
# 1. Cuts mid-word ("environme...")
# 2. Character count != token count
# 3. No indication text was truncated
# 4. Always cuts from the end

Better: token-aware with strategy

def smart_truncate(text, max_tokens, strategy="head"):
    """Token-aware, word-boundary, strategy-based."""
    words = text.split()
    max_words = int(max_tokens / 1.3)
    if len(words) <= max_words:
        return text
    if strategy == "head":
        return " ".join(words[:max_words]) + " [...]"
    elif strategy == "tail":
        return "[...] " + " ".join(words[-max_words:])
    else:  # middle
        head_n = int(max_words * 0.6)
        tail_n = max_words - head_n
        dropped = len(words) - max_words
        return (" ".join(words[:head_n])
                + f" [...{dropped} omitted...] "
                + " ".join(words[-tail_n:]))

Implement truncate_middle()

Write Code

Write truncate_middle(text, max_tokens) that keeps the start and end of the text while cutting the middle.

Rules:

Convert max_tokens to word count using int(max_tokens / 1.3)

If the text already fits, return it unchanged

Keep 60% of words from the start, 40% from the end (use int() for both)

Join head and tail with ' [...] ' between them

Use text.split() and ' '.join() for word operations

Loading editor...

The Lost-in-the-Middle Effect

I learned this the hard way while building a document Q&A system. My retrieval pipeline returned the right passages, but the LLM kept ignoring answers that appeared in the middle of a long context. After hours of debugging, I found Liu et al. (2023) — and everything clicked.

Their research showed that LLMs pay strongest attention to information at the beginning and end of the context. Content in the middle gets significantly less attention. This U-shaped pattern explains why smart-middle truncation works: it preserves exactly the positions the model attends to most.

Simulating the U-shaped attention pattern

Loading editor...

Predict the output: where does attention drop?

Loading editor...

Build a ContextBudget Manager

The cleanest approach to managing context is a dedicated class that handles estimation, allocation, and truncation in one place. Scattering token math across your codebase is a recipe for subtle bugs — one miscalculation and your API calls silently lose critical information.

The complete ContextBudget class

Loading editor...

Four methods, each with a single responsibility. allocate() does the math, truncate() handles cutting, fit() combines them, and report() makes results readable. Each method is independently testable.

ContextBudget in action — auto-fitting long history

Loading editor...

Implement allocate_budget()

Write Code

Write allocate_budget(window_size, system_prompt, user_message) that returns a dictionary with token allocations.

Rules:

Reserve 20% of the window for output: output_reserved = int(window_size * 0.2)

Estimate tokens for system prompt and user message using math.ceil(word_count * 1.3)

Calculate max_input = window_size - output_reserved

Calculate history_budget = max_input - system_tokens - user_tokens

Return a dict with keys: output_reserved, system_tokens, user_tokens, history_budget, max_input

Loading editor...

Common Mistakes and How to Fix Them

These four pitfalls account for most context-window bugs I've encountered in production. Each one is straightforward to prevent once you know it exists.

1. Ignoring Output Token Allocation

Developers fill the entire context window with input and set max_tokens to a huge number. The API either returns an error or silently truncates the response mid-sentence. Always subtract your desired output length from the window before computing input budgets.

2. Not Accounting for System Prompts

A detailed system prompt can easily use 500-2,000 tokens. If your budget ignores this overhead, you'll hit limits exactly when it matters — in production with real users sending long messages.

3. Using Character Count Instead of Token Count

Characters vs words vs tokens diverge badly

Loading editor...

4. Hardcoding Window Sizes

Model window sizes change with every release. GPT-4 started at 8K, jumped to 32K, then 128K. Hardcoding 128000 throughout your codebase means a tedious search-and-replace every time a model updates. Use a config dictionary instead.

Centralized model configuration

Loading editor...

Summary and Next Steps

The context window is the single most important constraint in LLM application design. Every prompt, every conversation, every document you send must fit within it — along with the response you want back.

Here is what we covered: tokens are the real currency (not words or characters), input and output share the same budget, truncation strategy depends on content type, the middle of the context gets the least attention, and a dedicated ContextBudget class keeps the math out of your application logic.

Next in the learning path, we move from managing context to crafting what goes inside it — prompt engineering patterns that make your LLM calls more reliable and effective.

Frequently Asked Questions

Can I increase a model's context window?

No. The context window is fixed during model training. Some providers offer extended-context variants, but you cannot change the limit for a given model through API parameters. Your options are choosing a larger model or reducing your input.

Does using the full context window cost more?

Yes. Most APIs charge per token for both input and output. Sending 100K tokens of context costs roughly 10x more than sending 10K tokens. Longer contexts also increase latency. Send only what you actually need.

What happens if I exceed the context window?

The API returns an error (usually HTTP 400) with a message about exceeding the maximum context length. Your request is not processed at all. This is why pre-flight token estimation matters — catch overflows before making the API call.

Is bigger context always better?

Not necessarily. The lost-in-the-middle effect means retrieval accuracy can drop for facts placed in the middle of long contexts. Targeted retrieval with RAG often outperforms dumping everything into a giant context window.

References

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.

OpenAI. (2025). GPT-4o Model Card. platform.openai.com/docs/models

Anthropic. (2025). Claude Model Documentation. docs.anthropic.com/claude/docs/models-overview

Google DeepMind. (2024). Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens. arXiv:2403.05530.

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016.

Gage, P. (1994). A New Algorithm for Data Compression. C Users Journal, 12(2).

What Is a Context Window and Why Should You Care?

Counting Tokens — The Real Currency

The Token Budget — Input + Output = Context

Truncation Strategies

Head Truncation (Keep the Beginning)

Tail Truncation (Keep the End)

Smart-Middle Truncation (Keep Start + End)

The Lost-in-the-Middle Effect

Build a ContextBudget Manager

Common Mistakes and How to Fix Them

1. Ignoring Output Token Allocation

2. Not Accounting for System Prompts

3. Using Character Count Instead of Token Count

4. Hardcoding Window Sizes

Summary and Next Steps

Frequently Asked Questions

Can I increase a model's context window?

Does using the full context window cost more?

What happens if I exceed the context window?

Is bigger context always better?

References

Related Tutorials