Skip to main content

Chain-of-Thought Prompting: Build a Math and Logic Problem Solver in Python

Intermediate60 min2 exercises35 XP
0/2 exercises

Ask GPT-4o "What is 17 * 28 + 53?" and it sometimes gets the wrong answer. Add five words to that same prompt — "Let's think step by step" — and the accuracy jumps dramatically. Those five words trigger a technique called chain-of-thought prompting, and it is one of the most important tools in your prompt engineering toolkit.

The research behind this is striking — Google Brain showed that chain-of-thought prompting nearly doubled accuracy on grade-school math benchmarks. In this tutorial, you'll implement all three variants, measure the accuracy difference yourself, and walk away with a solver class you can drop into any project.

What Is Chain-of-Thought Prompting?

Chain-of-thought prompting asks the LLM to show its reasoning before giving a final answer. Instead of jumping straight to "the answer is 529," the model writes out intermediate steps: "First, 17 times 28 is 476. Then 476 plus 53 is 529." That explicit reasoning path makes the model far less likely to produce wrong answers on tasks that require multi-step logic.

I think of it this way: when you solve a math problem in your head, you sometimes get it wrong. When you write out each step on paper, you catch mistakes. CoT does the same thing for LLMs — it forces the model to "show its work," and the act of generating intermediate tokens actually shapes the final answer.

Let's see CoT in action. We'll ask the same math question with and without chain-of-thought and compare the results.

Direct prompting — the model jumps to an answer
Loading editor...
CoT prompting — the model reasons through each step
Loading editor...

The correct answer: 7 notebooks at $4 = $28, with 15% discount that's $28 * 0.85 = $23.80, so change = $50 - $23.80 = $26.20. The CoT version walks through these exact steps. The direct version sometimes collapses them into one jump and gets the discount math wrong.

Manual Chain-of-Thought — Writing the Reasoning Template

Manual CoT means you provide one or more worked examples in the prompt, showing the model exactly how to reason. This is the original technique from the 2022 Google Brain paper that kicked off the CoT revolution. It's the most reliable approach when you know the reasoning pattern your problem requires.

The idea is simple: you give the model a solved example with explicit reasoning steps, then ask it to solve a new problem in the same style. The model mimics your reasoning structure.

Manual CoT with a worked example
Loading editor...

The correct calculation: 250 4 = 1000 widgets total, 12% rejected means 1000 0.88 = 880 widgets pass. The model follows the same step-by-step structure from our example — calculate the total first, then apply the percentage.

I use manual CoT whenever the reasoning pattern isn't obvious from the question alone. Multi-step percentage calculations, unit conversions, and problems where the order of operations matters — these all benefit from a worked example that demonstrates the exact reasoning flow.

What happens when we use multiple worked examples? More examples generally improve accuracy, but with diminishing returns. Let's build a function that makes this easy to test.

Reusable manual CoT function
Loading editor...

The correct answer: full year at monthly rate = 12 $15 = $180, deal price = 8 $15 = $120, savings = $180 - $120 = $60. The model follows the same step-by-step decomposition pattern from our example.

Zero-Shot CoT — "Let's Think Step by Step"

Zero-shot CoT on three different problem types
Loading editor...

Zero-shot CoT requires no worked examples at all. You append "Let's think step by step" (or a similar phrase) to the end of your question, and the model generates its own reasoning chain. This was discovered by Kojima et al. in 2022 — a surprisingly simple trick that produces large accuracy gains on arithmetic, symbolic reasoning, and logic problems.

The phrase itself isn't magic. "Think through this carefully," "Reason about this step by step," and "Show your work" all trigger similar reasoning behavior. What matters is that the prompt explicitly tells the model to generate intermediate reasoning tokens before the final answer.

There is a catch. The raw CoT output mixes reasoning and answer into one block of text. For a pipeline that needs a clean answer, that's a problem. Let's fix that with a two-step approach.

Two-step zero-shot CoT with answer extraction
Loading editor...

The correct answer: with "buy 3, get 1 free," Sam gets 1 free book for every 3 he buys. 7 books means he gets 1 free (for the first 3), and then another 1 free (for the next 3), so he pays for 5 books. 5 * $12 = $60. The two-step approach gives you clean output your code can parse, while still getting the accuracy benefit of reasoning.

Practice: Build a CoT Prompt for Logic Problems

Time to practice. This exercise tests whether you understand how to structure a manual chain-of-thought prompt with worked examples.

Build a Chain-of-Thought Prompt Builder
Write Code

Write a function build_cot_prompt(question, examples) that builds a chain-of-thought prompt string.

Rules:

  • examples is a list of tuples: [(question_str, reasoning_str), ...]
  • The prompt starts with: "Solve the following problem. Show your reasoning step by step before giving the final answer."
  • Each example is formatted as: "\nExample:\nQ: {question}\nA: {reasoning}"
  • After all examples, add: "\nNow solve this:\nQ: {question}\nA: Let me work through this step by step."
  • Return the complete prompt string
  • Loading editor...

    Auto-CoT — Generating Examples Automatically

    Manual CoT works well, but writing good worked examples takes effort. What if you could have the LLM generate the examples for you? That's the idea behind Auto-CoT, introduced by Zhang et al. in 2022. The process has two phases: first, cluster your questions by type; second, ask the model to generate a reasoning chain for one representative question from each cluster.

    In practice, I've found that a simpler version works for most use cases: take a few sample questions, let the model solve them with zero-shot CoT, and use those generated solutions as few-shot examples for harder problems. The key insight is that even imperfect auto-generated examples improve accuracy over zero-shot alone.

    Auto-generating CoT examples from simpler problems
    Loading editor...

    Those auto-generated examples now serve as few-shot demonstrations for harder problems. The model has already produced step-by-step reasoning — we just reuse it.

    Using auto-generated examples for harder problems
    Loading editor...

    The correct answer: 40% of 120 = 48 engineers, 10% bonus on $80,000 = $8,000 per engineer, total = 48 * $8,000 = $384,000. Auto-CoT saved us from writing the worked examples by hand, and the model still follows a clean step-by-step reasoning pattern.

    Measuring the Accuracy Impact of CoT

    I used to take "CoT improves accuracy" at face value until I ran my own benchmarks and discovered it varied wildly by problem type. Some categories saw a 40% jump; others saw zero improvement. The only way to know is to measure. Let's build a simple benchmark that compares direct prompting and zero-shot CoT on problems with known answers.

    Benchmarking direct vs CoT accuracy on math problems
    Loading editor...

    Your exact results will vary by run, but CoT consistently scores equal or higher than direct prompting on these arithmetic problems. The gap widens on harder problems — multi-step word problems, percentage-of-percentage calculations, and rate-time-distance questions are where CoT truly shines.

    Building a Reusable CoT Solver

    Let's combine everything into a class that picks the right CoT technique based on whether you provide examples or not. This is the pattern I use in production — a single interface that handles manual CoT, zero-shot CoT, and Auto-CoT behind the scenes.

    CoTSolver class — core and zero-shot strategy
    Loading editor...

    The _call_llm helper keeps API calls DRY. The zero-shot strategy uses the same two-step approach we built earlier — generate reasoning, then extract a clean answer. Adding the manual and auto strategies follows the same pattern.

    Adding manual and auto strategies
    Loading editor...
    Testing the CoTSolver with different strategies
    Loading editor...

    The correct answer: 85 $500 = $42,500 total bonuses, taxed at 22% means each person keeps 78%, so $42,500 0.78 = $33,150. Both strategies should arrive at this. The class gives you a clean API — pass examples for manual CoT, pass sample_questions for Auto-CoT, or pass neither for zero-shot.

    When Chain-of-Thought Doesn't Help (and Hurts)

    CoT isn't a universal improvement. I've seen cases where it actually degrades performance, and understanding when is as important as knowing how to use it.

    Simple factual recall doesn't benefit from CoT. If you ask "What is the capital of France?" and add "Let's think step by step," the model might overthink: "Well, France is a country in Europe. Its major cities include Paris, Lyon, Marseille..." and sometimes talk itself into a wrong answer by considering too many alternatives. For direct knowledge retrieval, skip CoT entirely.

    Classification tasks with clear categories are another case. Sentiment analysis ("Is this review positive or negative?") works better with direct prompting. CoT adds reasoning tokens that just restate the obvious — "The review mentions 'terrible' and 'disappointed,' so the sentiment is negative." That reasoning doesn't improve accuracy; it just costs more tokens.

    CoT overkill for simple tasks
    # Unnecessary CoT for factual recall
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content":
            "What is the boiling point of water in "
            "Celsius? Let's think step by step."
        }],
        temperature=0,
    )
    print(response.choices[0].message.content)
    print(f"Tokens: {response.usage.total_tokens}")
    Direct prompting for simple tasks
    # Direct prompting — faster, cheaper, same accuracy
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content":
            "What is the boiling point of water in "
            "Celsius? Answer with just the number."
        }],
        temperature=0,
    )
    print(response.choices[0].message.content)
    print(f"Tokens: {response.usage.total_tokens}")

    Common Mistakes and How to Fix Them

    After building CoT prompts for dozens of production applications, I see the same mistakes repeated. Here are the top four.

    Mistake 1: Worked examples that don't match the reasoning pattern. You provide an addition example for a percentage problem. The model follows the addition structure and gets confused when percentages appear. Fix: your examples must share the same reasoning structure as the target problem, not just the same topic.

    Mistake 2: Asking for reasoning but ignoring it. You add "think step by step" but then parse only the last line. If the model's reasoning contains an error, you miss it. Fix: log the full reasoning for debugging and use the two-step extraction approach to get a clean answer separately.

    Mistake 3: Using CoT at temperature > 0 for deterministic problems. If you need a specific correct answer to a math problem, use temperature=0. Higher temperatures introduce randomness in the reasoning chain, which can cascade — one wrong intermediate step leads to a wrong final answer. Save temperature > 0 for creative tasks.

    Mistake 4: Too many steps in the reasoning chain. More steps aren't always better. If your worked example has 12 steps for a 3-step problem, the model might invent unnecessary intermediate calculations. Keep your examples concise — match the number of reasoning steps to the actual complexity of the problem.

    Practice: Accuracy Comparison Function

    In this exercise, you'll build a function that compares accuracy between two sets of results — the kind of analysis you'd need when benchmarking CoT against direct prompting.

    Build an Accuracy Comparison Reporter
    Write Code

    Write a function compare_accuracy(direct_results, cot_results) that takes two lists of booleans (True = correct, False = wrong) and returns a dictionary with the comparison.

    Rules:

  • direct_results and cot_results are lists of booleans of the same length
  • Return a dict with keys: "direct_accuracy" (float, 0 to 1), "cot_accuracy" (float, 0 to 1), "improvement" (float, cot - direct), and "winner" (string: "cot", "direct", or "tie")
  • Accuracy is calculated as: count of True / total count
  • "winner" is "cot" if cot accuracy > direct accuracy, "direct" if direct > cot, "tie" if equal
  • Loading editor...

    Frequently Asked Questions

    Does CoT work with all LLMs?

    CoT is most effective on larger models (GPT-4o, Claude 3.5, Gemini 1.5 Pro). Smaller models with fewer than ~10B parameters often don't benefit — they struggle to generate coherent reasoning chains. If you're using a small model, test before assuming CoT will help.

    How many few-shot examples should I use for manual CoT?

    For most tasks, 2-3 examples hit the sweet spot. One example teaches the format but gives the model only one reasoning pattern to mimic. Two to three examples show variation and make the pattern more robust. Beyond 4-5 examples, you hit diminishing returns and start consuming significant context window space.

    Can I combine CoT with system prompts?

    Yes, and it's a powerful combination. Set the system prompt to define the role and constraints ("You are a math tutor. Always show your work."), then use CoT in the user message for the specific problem. The system prompt ensures consistent formatting across all responses.

    What's the difference between CoT and "scratchpad" prompting?

    Scratchpad prompting asks the model to use a designated area (e.g., <scratchpad>...</scratchpad> tags) for intermediate work. It's CoT with explicit structure markers. This is useful when you need to parse the reasoning separately from the answer — the tags make extraction trivial.

    References

  • Wei, J., Wang, X., Schuurmans, D., et al. — "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. arXiv
  • Kojima, T., Gu, S.S., Reid, M., et al. — "Large Language Models are Zero-Shot Reasoners." NeurIPS 2022. arXiv
  • Zhang, Z., Zhang, A., Li, M., Smola, A. — "Automatic Chain of Thought Prompting in Large Language Models." ICLR 2023. arXiv
  • OpenAI API Documentation — Chat Completions. Link
  • Wang, X., Wei, J., Schuurmans, D., et al. — "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR 2023. arXiv
  • Yao, S., Yu, D., Zhao, J., et al. — "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS 2023. arXiv
  • Related Tutorials