LLM Sampling Parameters: Temperature, Top-p, and Top-k Explained With Interactive Python Simulations

Beginner45 min3 exercises50 XP

Prerequisites

0/3 exercises

Ask ChatGPT "What is Python?" twice, and you get two different answers. The words change, the structure shifts, sometimes the tone flips entirely. This isn't a bug — it's controlled randomness, and three parameters govern exactly how much randomness the model uses.

What Are Sampling Parameters and Why Do They Matter?

An LLM doesn't output words directly. It produces a list of scores (called logits) for every possible next word, then converts those scores into probabilities using a function called softmax. The model picks the next word by sampling from that probability distribution.

Sampling parameters sit between the raw scores and the final word selection. They reshape the probability distribution — making it sharper, flatter, or cutting off unlikely options entirely. The three most important ones are temperature, top-k, and top-p.

We'll simulate all of this in Python using numpy. No API key required, no cloud calls — just math you can see and tinker with. Here's our toy vocabulary and a set of fake logits to start:

Softmax: from logits to probabilities

Loading editor...

"the" dominates at 39.5%, while "red" barely registers at 0.7%. The model strongly prefers "the" as the next token, but there's still a chance it picks something else. Every run of the model could produce a different word.

Temperature — The Creativity Dial

What happens if you divide all logits by a number before passing them through softmax? That number is the temperature, and it's the single most impactful sampling parameter.

The formula is straightforward: softmax(logits / T). Low temperature makes high-probability words more dominant. High temperature flattens the distribution, giving every word a more equal chance. Think of it as a confidence dial.

Temperature changes the probability shape

Loading editor...

At T=0.1, "the" takes 99.3% of the probability mass — the model is nearly deterministic. At T=2.0, the gap between "the" (23.3%) and "red" (3.2%) shrinks dramatically. More words have a realistic shot at being chosen.

Let's see how this plays out when we actually sample words. I'll draw 10 tokens at each temperature to make the effect concrete:

Sampling at different temperatures

Loading editor...

At T=0.5, it's mostly "the" and "cat." At T=2.0, words like "big," "dog," and "hat" start appearing. This is why creative writing applications use higher temperatures — they want variety.

Apply Temperature Scaling

Write Code

Write a function apply_temperature(logits, temperature) that:

1. Divides the logits by the temperature value

2. Applies softmax to get probabilities

3. Returns the probability array

Use the softmax formula: exp(x - max(x)) / sum(exp(x - max(x)))

Loading editor...

Top-k — Cutting the Long Tail

Temperature has a problem: even at reasonable settings, the model can still pick extremely unlikely tokens. With a vocabulary of 50,000+ words, there's always a small chance the model outputs something nonsensical. Top-k fixes this by brute force.

The idea is simple — keep only the k most probable tokens and throw everything else out. After removing the low-probability tail, renormalize so the remaining probabilities sum to 1. The model can only choose from the top k candidates.

Top-k filtering in action

Loading editor...

With k=3, only "the," "cat," and "sat" survive. The model cannot produce any of the other 7 words, regardless of temperature. With k=5, "on" and "mat" also make the cut, adding slight variety.

Sampling confirms the difference. Notice how k=3 produces very repetitive output while k=10 (all tokens) allows more variety:

Sampling with different k values

Loading editor...

Implement Top-k Filtering

Write Code

Write a function top_k_filter(probs, k) that:

1. Finds the indices of the top k probabilities

2. Creates a new array with zeros everywhere except those top-k positions

3. Copies the original probabilities into those positions

4. Renormalizes so the result sums to 1.0

Hint: np.argsort(probs)[-k:] gives the indices of the k largest values.

Loading editor...

Top-p (Nucleus Sampling) — Adaptive Vocabulary Size

I find top-p more elegant than top-k. Instead of always keeping a fixed number of tokens, top-p adapts to how confident the model is. The idea: sort tokens by probability, walk down the list, and stop once the cumulative probability crosses a threshold p.

When the model is confident about one word, top-p might keep just 1 or 2 tokens. When the model is uncertain, it might keep 8 or 9. The vocabulary size adjusts automatically to the situation.

How top-p builds the nucleus

Loading editor...

With p=0.8, the cutoff happens after "on" (cumsum reaches 0.826), keeping 4 tokens. With p=0.5, only "the" and "cat" survive (cumsum 0.635). The nucleus grows or shrinks based on how spread out the probabilities are.

My favorite part about top-p is how it adapts to confidence. Watch what happens with two very different logit distributions:

Top-p adapts to model confidence

Loading editor...

The confident model keeps just 1 token (94.8% on "the" alone already exceeds 0.9). The uncertain model keeps 9 tokens because no single word dominates. Same parameter, completely different behavior — that's the power of nucleus sampling.

Implement Top-p (Nucleus) Filtering

Write Code

Write a function top_p_filter(probs, p) that:

1. Sorts probabilities in descending order

2. Computes the cumulative sum

3. Finds where cumsum first exceeds p — include that token (so the nucleus always contains at least the tokens needed to reach p)

4. Zeros out all tokens NOT in the nucleus

5. Renormalizes so the result sums to 1.0

Hint: np.searchsorted(cumsum, p) finds where p would be inserted in the sorted cumsum array.

Loading editor...

Combining Parameters in Practice

Parameter presets for different use cases

Loading editor...

The pipeline always runs in the same order: temperature first (reshapes the distribution), then top-k or top-p (cuts unlikely tokens), then sample. Each step narrows the possibilities further.

Creative Writing (T=1.2, top_p=0.9)

# High temperature + wide nucleus
# 7 active tokens, diverse samples
# Samples: ['cat', 'dog', 'sat', 'cat', 'the', 'the', 'the', 'mat']
# Good for: brainstorming, fiction, poetry

Factual Q&A (T=0.1, top_k=1)

# Very low temperature + single token
# 1 active token, always picks "the"
# Samples: ['the', 'the', 'the', 'the', 'the', 'the', 'the', 'the']
# Good for: factual answers, deterministic output

In my experience, the most common mistake is tuning both temperature and top-p at the same time. Start by adjusting temperature alone. Only add top-p or top-k if you need a hard cutoff on unlikely tokens.

Common Mistakes and How to Fix Them

These are pitfalls I've seen repeatedly — in my own code and in production systems I've reviewed. Each one seems obvious in retrospect, but catches people when they're experimenting.

Another frequent issue: adjusting temperature and top-p simultaneously without understanding their interaction. Temperature reshapes the distribution before top-p slices it. A high temperature with a low top-p can produce unpredictable nucleus sizes.

Temperature changes nucleus size

Loading editor...

At T=0.5, top_p=0.9 keeps only 2 tokens. At T=2.0, the same top_p keeps 8 tokens. If you set both parameters without thinking about this interaction, your output quality becomes hard to predict.

Summary and Next Steps

Three parameters, three different mechanisms. Temperature reshapes the entire probability distribution. Top-k hard-cuts to a fixed number of candidates. Top-p adapts the cutoff based on model confidence. They all sit between raw logits and final token selection.

The practical recipe: start with temperature alone (0.7 for balanced, lower for factual, higher for creative). Add top-p (0.9–0.95) if you want a safety net against unlikely tokens. Use top-k only if you need a strict upper bound on vocabulary size.

These parameters apply to every LLM API — OpenAI, Anthropic, Google, Mistral, local models. Understanding them here with numpy means you'll know exactly what's happening when you pass temperature=0.3 to any API call.

FAQ

Can I use temperature and top-p together?

Yes, but most API providers recommend adjusting only one at a time. Temperature reshapes probabilities first, then top-p slices the result. Changing both simultaneously makes it harder to predict behavior. Start with temperature, add top-p only if needed.

What is the difference between top-k and top-p?

Top-k always keeps exactly k tokens regardless of the distribution shape. Top-p adapts — it keeps fewer tokens when the model is confident and more when the model is uncertain. Top-p is generally preferred because it handles both cases well without manual tuning.

Do these parameters affect the model's intelligence?

No. Sampling parameters only control which token gets selected from the model's predictions. The model computes the same logits regardless of temperature or filtering. A low temperature doesn't make the model smarter — it just makes it pick its top choice more reliably.

Why does temperature=0 sometimes still give different outputs?

Some APIs run on multiple GPU servers with slightly different floating-point rounding. Two servers might compute logits that differ by 0.0001, which can flip the top token when probabilities are close. This is rare but not a bug — it's hardware-level non-determinism.

References

Holtzman, A., et al. "The Curious Case of Neural Text Degeneration." ICLR 2020. Introduces nucleus (top-p) sampling.

Fan, A., Lewis, M., & Dauphin, Y. "Hierarchical Neural Story Generation." ACL 2018. Introduces top-k sampling for text generation.

OpenAI API Reference — Chat Completions. Documents temperature, top_p, frequency_penalty, presence_penalty.

Anthropic Claude API — Messages. Documents temperature and top_p parameters.

Google Gemini API — Generation Config. Documents temperature, top_p, and top_k.

Radford, A., et al. "Language Models are Unsupervised Multitask Learners." OpenAI 2019. GPT-2 paper where top-k sampling gained popularity.

What Are Sampling Parameters and Why Do They Matter?

Temperature — The Creativity Dial

Top-k — Cutting the Long Tail

Top-p (Nucleus Sampling) — Adaptive Vocabulary Size

Combining Parameters in Practice

Common Mistakes and How to Fix Them

Summary and Next Steps

FAQ

Can I use temperature and top-p together?

What is the difference between top-k and top-p?

Do these parameters affect the model's intelligence?

Why does temperature=0 sometimes still give different outputs?

References

Related Tutorials