Skip to main content

Context Windows Explained: Build a Smart Token-Budget Manager in Python

Intermediate60 min3 exercises55 XP
0/3 exercises

You paste a 50-page PDF into ChatGPT and get back: "This conversation has exceeded the model's maximum context length." That wall you just hit is the context window — and understanding it changes how you design every LLM application.

What Is a Context Window and Why Should You Care?

Picture a desk with a fixed surface area. Everything the model reads and writes must fit on that desk — your system prompt, the conversation history, the user's latest message, and the model's response. That desk is the context window, measured in tokens.

Different models have wildly different desk sizes. GPT-4o gives you 128K tokens. Claude 3.5 offers 200K. Gemini 1.5 Pro stretches to 2 million. Picking the right model for your use case starts with knowing these numbers.

Provider context window comparison
Loading editor...

Context windows keep growing, but they are never infinite. Even with Gemini's 2M tokens, a single large codebase can exceed the limit. Larger contexts also mean higher latency and cost — stuffing everything in is rarely the right call.


Counting Tokens — The Real Currency

Build a simple token estimator
Loading editor...

Word count alone is misleading. Code and URLs contain punctuation and symbols that each become separate tokens. Our estimate_tokens function uses a 1.3x multiplier on word count, which is a reasonable approximation for English prose.

Estimate token cost for real documents
Loading editor...

A 50-page PDF at ~250 words per page uses roughly 16,250 tokens. That fits easily in any modern context window. But add conversation history, system prompts, and room for a response — the budget fills up faster than you'd expect.

Build estimate_tokens()
Write Code

Write estimate_tokens(text) that estimates the token count of a string. Use the formula: split on whitespace, multiply the word count by 1.3, and round up with math.ceil. Return an integer.

Return 0 for empty or whitespace-only strings.

Loading editor...

The Token Budget — Input + Output = Context

How much of the context window does your response actually get? This is the question most developers forget to ask. The context window is shared: input tokens (your prompt) and output tokens (the model's reply) must fit together inside it.

Visualize the token budget breakdown
Loading editor...

In this scenario, 123K tokens of input leave only 5K for the reply — but max_tokens caps it at 4,096. The model gets barely 3.2% of the window for its response. Long conversations silently erode the output budget until the model starts cutting its own answers short.

Quick Check: calculate the input budget
Loading editor...

Truncation Strategies

Your text exceeds the budget and something has to go. The question isn't whether to truncate — it's where to cut. Three strategies cover the vast majority of use cases, and each works best for a different type of content.

Head Truncation (Keep the Beginning)

Head truncation keeps the first N tokens and drops the rest. This works well for documents where the most important information appears early — executive summaries, news articles, and structured reports with key findings up front.

Head truncation — keep the start
Loading editor...

Tail Truncation (Keep the End)

Tail truncation keeps the last N tokens and drops the beginning. This is the right choice for chat history — recent messages matter more than older ones. It's also useful for log files where the latest entries carry the most signal.

Tail truncation — keep the end
Loading editor...

Smart-Middle Truncation (Keep Start + End)

Smart-middle truncation preserves the beginning and end while cutting from the middle. This keeps context from both the opening (which often sets the scene) and the conclusion (which often holds the answer). Research shows LLMs attend most strongly to these positions.

Smart-middle truncation — keep start and end
Loading editor...
Naive: cut at character limit
def naive_truncate(text, max_chars=500):
    """Cuts mid-word, ignores token count."""
    return text[:max_chars]

# Problems:
# 1. Cuts mid-word ("environme...")
# 2. Character count != token count
# 3. No indication text was truncated
# 4. Always cuts from the end
Better: token-aware with strategy
def smart_truncate(text, max_tokens, strategy="head"):
    """Token-aware, word-boundary, strategy-based."""
    words = text.split()
    max_words = int(max_tokens / 1.3)
    if len(words) <= max_words:
        return text
    if strategy == "head":
        return " ".join(words[:max_words]) + " [...]"
    elif strategy == "tail":
        return "[...] " + " ".join(words[-max_words:])
    else:  # middle
        head_n = int(max_words * 0.6)
        tail_n = max_words - head_n
        dropped = len(words) - max_words
        return (" ".join(words[:head_n])
                + f" [...{dropped} omitted...] "
                + " ".join(words[-tail_n:]))
Implement truncate_middle()
Write Code

Write truncate_middle(text, max_tokens) that keeps the start and end of the text while cutting the middle.

Rules:

  • Convert max_tokens to word count using int(max_tokens / 1.3)
  • If the text already fits, return it unchanged
  • Keep 60% of words from the start, 40% from the end (use int() for both)
  • Join head and tail with ' [...] ' between them
  • Use text.split() and ' '.join() for word operations
  • Loading editor...

    The Lost-in-the-Middle Effect

    I learned this the hard way while building a document Q&A system. My retrieval pipeline returned the right passages, but the LLM kept ignoring answers that appeared in the middle of a long context. After hours of debugging, I found Liu et al. (2023) — and everything clicked.

    Their research showed that LLMs pay strongest attention to information at the beginning and end of the context. Content in the middle gets significantly less attention. This U-shaped pattern explains why smart-middle truncation works: it preserves exactly the positions the model attends to most.

    Simulating the U-shaped attention pattern
    Loading editor...
    Predict the output: where does attention drop?
    Loading editor...

    Build a ContextBudget Manager

    The cleanest approach to managing context is a dedicated class that handles estimation, allocation, and truncation in one place. Scattering token math across your codebase is a recipe for subtle bugs — one miscalculation and your API calls silently lose critical information.

    The complete ContextBudget class
    Loading editor...

    Four methods, each with a single responsibility. allocate() does the math, truncate() handles cutting, fit() combines them, and report() makes results readable. Each method is independently testable.

    ContextBudget in action — auto-fitting long history
    Loading editor...
    Implement allocate_budget()
    Write Code

    Write allocate_budget(window_size, system_prompt, user_message) that returns a dictionary with token allocations.

    Rules:

  • Reserve 20% of the window for output: output_reserved = int(window_size * 0.2)
  • Estimate tokens for system prompt and user message using math.ceil(word_count * 1.3)
  • Calculate max_input = window_size - output_reserved
  • Calculate history_budget = max_input - system_tokens - user_tokens
  • Return a dict with keys: output_reserved, system_tokens, user_tokens, history_budget, max_input
  • Loading editor...

    Common Mistakes and How to Fix Them

    These four pitfalls account for most context-window bugs I've encountered in production. Each one is straightforward to prevent once you know it exists.

    1. Ignoring Output Token Allocation

    Developers fill the entire context window with input and set max_tokens to a huge number. The API either returns an error or silently truncates the response mid-sentence. Always subtract your desired output length from the window before computing input budgets.

    2. Not Accounting for System Prompts

    A detailed system prompt can easily use 500-2,000 tokens. If your budget ignores this overhead, you'll hit limits exactly when it matters — in production with real users sending long messages.

    3. Using Character Count Instead of Token Count

    Characters vs words vs tokens diverge badly
    Loading editor...

    4. Hardcoding Window Sizes

    Model window sizes change with every release. GPT-4 started at 8K, jumped to 32K, then 128K. Hardcoding 128000 throughout your codebase means a tedious search-and-replace every time a model updates. Use a config dictionary instead.

    Centralized model configuration
    Loading editor...

    Summary and Next Steps

    The context window is the single most important constraint in LLM application design. Every prompt, every conversation, every document you send must fit within it — along with the response you want back.

    Here is what we covered: tokens are the real currency (not words or characters), input and output share the same budget, truncation strategy depends on content type, the middle of the context gets the least attention, and a dedicated ContextBudget class keeps the math out of your application logic.

    Next in the learning path, we move from managing context to crafting what goes inside it — prompt engineering patterns that make your LLM calls more reliable and effective.


    Frequently Asked Questions

    Can I increase a model's context window?

    No. The context window is fixed during model training. Some providers offer extended-context variants, but you cannot change the limit for a given model through API parameters. Your options are choosing a larger model or reducing your input.

    Does using the full context window cost more?

    Yes. Most APIs charge per token for both input and output. Sending 100K tokens of context costs roughly 10x more than sending 10K tokens. Longer contexts also increase latency. Send only what you actually need.

    What happens if I exceed the context window?

    The API returns an error (usually HTTP 400) with a message about exceeding the maximum context length. Your request is not processed at all. This is why pre-flight token estimation matters — catch overflows before making the API call.

    Is bigger context always better?

    Not necessarily. The lost-in-the-middle effect means retrieval accuracy can drop for facts placed in the middle of long contexts. Targeted retrieval with RAG often outperforms dumping everything into a giant context window.


    References

  • Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
  • OpenAI. (2025). GPT-4o Model Card. platform.openai.com/docs/models
  • Anthropic. (2025). Claude Model Documentation. docs.anthropic.com/claude/docs/models-overview
  • Google DeepMind. (2024). Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens. arXiv:2403.05530.
  • Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016.
  • Gage, P. (1994). A New Algorithm for Data Compression. C Users Journal, 12(2).
  • Related Tutorials