Skip to main content

How LLMs Work: From Prompt to Response, Step by Step

Beginner45 min3 exercises55 XP
0/3 exercises

You type a question into ChatGPT and a fluent answer appears in seconds. I spent weeks treating LLMs as magic boxes until I traced what actually happens between keypress and response — and prompt engineering suddenly made sense. This tutorial walks through every stage of that pipeline with runnable Python simulations.

What Happens When You Press Enter

Picture this: you type "The capital of France is" and press Enter. The model doesn't read your sentence the way you do — it can't see words, only numbers. So the first thing it does is chop your text into small pieces called tokens and convert each one to an ID. This was my first "aha" moment with LLMs.

From there, the model runs five stages in sequence. Here's the entire pipeline in pseudocode — we'll build each stage in real Python over the next sections.

The LLM Pipeline — Pseudocode
Loading editor...

The key insight: an LLM is a next-token prediction machine on a loop. Every feature you've seen — conversations, code generation, translation — emerges from repeating these five stages hundreds of times. The Python simulations in this tutorial use tiny dimensions and vocabularies to make the math visible. Real LLMs use the same operations at much larger scale.

Tokens: How LLMs Read Text

Why can't the model just use whole words? Because there are millions of possible words (misspellings, compound words, other languages), and a vocabulary that large would be impossibly slow. Instead, LLMs use subword tokens — pieces that balance vocabulary size against flexibility.

The word "unhappiness" might become three tokens: ["un", "happiness", ""] or ["un", "happ", "iness"], depending on the tokenizer. Common words like "the" stay whole. Rare words get split. This is why API pricing is in tokens, not words.

Real tokenizers like BPE (Byte Pair Encoding) need C libraries that can't run in the browser. But we can simulate the core idea — splitting text into subword pieces and assigning IDs — with pure Python.

This simulation assigns each unique word fragment a numeric ID, just like a real tokenizer's vocabulary lookup. Watch how the same text maps to different token counts depending on how we split it.

Simulating Tokenization
Loading editor...

Notice that 6 words became 8 tokens because "capital", "France", and "beautiful" each got split. In real LLMs, a rule of thumb is that 1 word ≈ 1.3 tokens for English text. That ratio matters for cost estimation and context window planning.

I use token counting constantly in production. Before sending a long document to an API, I estimate the token count to check whether it fits in the context window and how much the call will cost.

Estimating Token Count and Cost
Loading editor...
Exercise 1: Build a Token Estimator
Write Code

Write a function estimate_tokens(text) that takes a string and returns an estimated token count.

Rules:

  • Split the text on whitespace to count words
  • Multiply the word count by 1.3 (the English approximation ratio)
  • Return the result rounded UP to the nearest integer
  • Use numpy.ceil for rounding
  • Example: "Hello world" → 2 words × 1.3 = 2.6 → 3 tokens

    Loading editor...

    Embeddings: Giving Tokens Meaning

    Token 487 and token 12,903. Which one means "happy"? You can't tell — and neither can the model, because token IDs are arbitrary numbers with no meaning. The model needs a way to represent what each token means, and that's what embeddings do.

    An embedding is a list of numbers (a vector) that represents a token's meaning. Think of it like GPS coordinates, but instead of latitude and longitude, you have hundreds of dimensions that capture different aspects of meaning.

    In a real LLM, embeddings are learned during training — GPT-4's might have 1,536 dimensions. I find it helpful to work with tiny 4-dimensional vectors where you can see exactly what's happening. The math is identical, just smaller.

    This code creates a miniature embedding table for five tokens and measures how similar they are using cosine similarity — the standard metric for comparing embeddings.

    Building a Tiny Embedding Table
    Loading editor...

    Now the interesting part: measuring similarity. Two vectors pointing in the same direction have a cosine similarity near 1.0. Perpendicular vectors score 0.0. Opposite directions score -1.0.

    Cosine Similarity Between Token Embeddings
    Loading editor...

    Look at the results. "King" and "throne" should show high similarity (both are royalty-related). "King" and "queen" should also be similar but with a difference in the gender dimension. These relationships emerge naturally from training on billions of sentences.

    Attention: How the Model Decides What Matters

    Here's a sentence: "The bank approved the loan because it had sufficient collateral." What does "it" refer to? You know instantly — "it" means the loan application, not the bank. But how would a model figure that out from a flat list of embedding vectors?

    This is the problem attention solves. Each token gets to "look at" every other token and decide how much to pay attention to it. The word "it" would assign high attention to "loan" and "collateral," low attention to "The" and "the."

    Attention in Plain Language
    Loading editor...

    The math is surprisingly compact — dot products between queries and keys give attention scores, softmax makes them sum to 1, and those weights mix the values. Let's build it from scratch with numpy.

    Attention from Scratch with NumPy
    Loading editor...

    Those raw scores need to be converted into probabilities. Softmax ensures each token's attention weights sum to 1 — so each token allocates a fixed budget of attention across all other tokens.

    Applying Softmax to Get Attention Weights
    Loading editor...

    Our example used a single attention mechanism, but real transformers run dozens in parallel. Each "head" learns to focus on different relationships — grammar, semantics, position.

    Exercise 2: Compute Attention Scores
    Write Code

    Write a function compute_attention_weights(query, keys) that:

    1. Computes the dot product between the query vector and each key vector

    2. Divides all scores by the square root of the vector dimension (for numerical stability)

    3. Applies softmax to convert scores into weights that sum to 1.0

    The function should return a 1D numpy array of attention weights.

    Softmax formula: softmax(x_i) = exp(x_i) / sum(exp(x_j) for all j)

    Tip: Subtract the max value before computing exp to avoid overflow.

    Loading editor...

    Next-Token Prediction: The Core Loop

    After dozens of attention layers refine the embeddings, the model outputs a single vector for the last token in the sequence. That vector gets multiplied by the vocabulary matrix to produce a score (called a logit) for every token in the vocabulary.

    A vocabulary of 50,000 tokens means 50,000 logits. Higher logit = the model thinks that token is more likely to come next. But logits aren't probabilities yet. We need softmax again — the same function from the attention section.

    From Logits to Probabilities
    Loading editor...

    "Paris" has the highest logit and gets ~70% of the probability mass. But notice that "London" and "Berlin" also have non-trivial probabilities. This surprised me at first — the model never outputs a single answer, just a probability distribution over the entire vocabulary.

    Temperature: Controlling Randomness

    That "something" is the temperature parameter. Before applying softmax, we divide every logit by the temperature value. This single number dramatically changes the output distribution.

    How Temperature Changes the Distribution
    Loading editor...

    Top-p Sampling: Cutting the Tail

    Even with moderate temperature, the model assigns some probability to absurd tokens. Top-p (nucleus) sampling fixes this by only considering tokens whose cumulative probability reaches a threshold p. If the top 3 tokens already account for 95% of the probability, why risk sampling from the other 49,997?

    Top-p Sampling Implementation
    Loading editor...
    Temperature=2.0 (Too Random)
    # "The capital of France is..."
    # temp=2.0 → nearly uniform distribution
    # Possible outputs: "pizza", "cat", "the"
    # The model "forgot" that Paris is the answer
    Temperature=0.0 (Deterministic)
    # "The capital of France is..."
    # temp→0 → all probability on top token
    # Always outputs: "Paris"
    # Perfect for factual questions
    Exercise 3: Sample with Temperature
    Write Code

    Write a function sample_with_temperature(logits, vocab, temperature) that:

    1. Divides the logits by the temperature

    2. Applies softmax to get probabilities

    3. Uses np.random.choice to sample one token based on those probabilities

    4. Returns the sampled token (a string)

    Important: Set np.random.seed(0) inside the function before sampling for reproducible results.

    For softmax, subtract the max before computing exp (numerical stability).

    Loading editor...

    Pre-Training and RLHF: How Models Learn

    We've seen the inference pipeline — how a trained model generates text. But how did the model learn to do any of this? The embedding table, the attention weights, the vocabulary scoring matrix — these are all numbers. Where did they come from?

    Training happens in three phases, each building on the last. I remember being surprised that the first phase has nothing to do with conversations or instructions.

    Phase 1: Pre-Training (Learning Language)

    The model reads trillions of tokens from the internet and learns one task: predict the next token. Each step is simple — predict, compute a loss (how wrong was the prediction?), update the weights to be less wrong next time. Repeat this a trillion times across the entire internet.

    Pre-Training Step (Simplified)
    Loading editor...

    After pre-training, the model can complete any text pattern it saw during training. But it's not a chatbot yet — it's more like an autocomplete engine for the entire internet. Ask it a question and it might continue with another question, because that's a likely pattern in forum threads.

    Phase 2: Supervised Fine-Tuning (Learning Conversations)

    Human trainers write thousands of example conversations: "User asks X, assistant responds with Y." The model is trained on these examples to learn the conversation format. After this phase, it responds to questions instead of just completing text.

    Phase 3: RLHF (Learning Preferences)

    The model generates two different responses to the same prompt. Human raters choose which response is better. A separate "reward model" learns from these preferences, and the LLM is updated to produce responses the reward model scores highly.

    RLHF Ranking Example
    Loading editor...

    But training also has hard limits. Pre-training data has a cutoff date — the model literally hasn't seen anything published after it. It also can't learn from your conversations unless explicitly fine-tuned.

    Common Misconceptions About LLMs

    Now that you understand the pipeline, let's correct four wrong mental models that I see constantly — even among experienced engineers.

    "LLMs Understand What They're Saying"

    An LLM doesn't "know" that Paris is in France. It learned that the token "Paris" has high probability after tokens like "capital of France." The difference matters: understanding implies reasoning from facts, while pattern matching can produce correct-looking answers from wrong reasoning.

    "Bigger Models Are Always Better"

    A 70B-parameter model will beat a 7B model on complex reasoning. But for tasks like text classification, sentiment analysis, or structured extraction, smaller models fine-tuned on task-specific data often outperform larger general models — at a fraction of the cost and latency.

    "LLMs Are Just Autocomplete"

    Phone autocomplete predicts the next word using simple frequency counts from your typing history. LLMs predict the next token using billions of learned parameters that capture syntax, semantics, world knowledge, and reasoning patterns across trillions of tokens. The mechanism is similar — next-token prediction — but the scale creates qualitatively different capabilities.

    "Hallucinations Are Bugs That Will Be Fixed"

    Hallucination isn't a bug — it's a consequence of probabilistic text generation. The model picks tokens based on learned probabilities, not by consulting a fact database. Sometimes the statistically likely continuation is factually wrong. RLHF reduces hallucination but can't eliminate it.

    Summary and Next Steps

    You've now traced the complete pipeline from prompt to response. Every time you press Enter, these five stages run in sequence: Tokenize (text → subword IDs), Embed (IDs → meaning vectors), Attend (tokens exchange context), Predict (score 50K+ vocabulary tokens), Sample (pick one token via temperature and top-p).

    Once I internalized this pipeline, I stopped guessing at prompt tricks and started reasoning about what the model actually does. Ready to go deeper? Build a tokenizer from scratch in Build a BPE Tokenizer, or call real LLM APIs in the next tutorial.

    FAQ

    How Is an LLM Different from a Search Engine?

    A search engine retrieves existing documents that match your query. An LLM generates new text token-by-token based on learned patterns. Search engines point you to sources. LLMs synthesize a response — which means they can answer questions no single document covers, but also fabricate plausible-sounding nonsense.

    Why Do LLMs Hallucinate?

    The architecture has no fact-checking step — it picks statistically likely tokens, not verified ones. In practice, you reduce hallucination by providing context (RAG), lowering temperature for factual tasks, and asking the model to cite sources. You can't eliminate it entirely.

    Can I Run an LLM on My Own Computer?

    Yes. Tools like Ollama and llama.cpp let you run models locally. A 7B-parameter model needs about 4-8 GB of RAM (quantized). A 70B model needs 40+ GB. For most developers, a local 7B model handles simple tasks well, while complex reasoning still benefits from larger cloud-hosted models.

    How Much Does It Cost to Use LLM APIs?

    Pricing is per token, split into input and output rates. Frontier models like GPT-4 Turbo cost roughly $10/$30 per million tokens. Smaller models like GPT-4o mini are 10-20x cheaper. For most applications, a single API call costs fractions of a cent.

    References

  • Vaswani, A. et al. (2017). Attention Is All You Need. The original transformer paper that introduced the attention mechanism used in all modern LLMs.
  • Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. The GPT-2 paper that demonstrated scaling language models.
  • Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. The InstructGPT paper describing RLHF.
  • Sennrich, R. et al. (2016). Neural Machine Translation of Rare Words with Subword Units. The BPE tokenization paper.
  • Holtzman, A. et al. (2020). The Curious Case of Neural Text Degeneration. Introduces nucleus (top-p) sampling.
  • OpenAI Tokenizer Tool. Interactive tool to see how text is tokenized.
  • Anthropic Research. Papers on constitutional AI, scaling, and safety.
  • Related Tutorials