Chain-of-Thought Prompting: Build a Math and Logic Problem Solver in Python
Ask GPT-4o "What is 17 * 28 + 53?" and it sometimes gets the wrong answer. Add five words to that same prompt — "Let's think step by step" — and the accuracy jumps dramatically. Those five words trigger a technique called chain-of-thought prompting, and it is one of the most important tools in your prompt engineering toolkit.
The research behind this is striking — Google Brain showed that chain-of-thought prompting nearly doubled accuracy on grade-school math benchmarks. In this tutorial, you'll implement all three variants, measure the accuracy difference yourself, and walk away with a solver class you can drop into any project.
What Is Chain-of-Thought Prompting?
Chain-of-thought prompting asks the LLM to show its reasoning before giving a final answer. Instead of jumping straight to "the answer is 529," the model writes out intermediate steps: "First, 17 times 28 is 476. Then 476 plus 53 is 529." That explicit reasoning path makes the model far less likely to produce wrong answers on tasks that require multi-step logic.
I think of it this way: when you solve a math problem in your head, you sometimes get it wrong. When you write out each step on paper, you catch mistakes. CoT does the same thing for LLMs — it forces the model to "show its work," and the act of generating intermediate tokens actually shapes the final answer.
Let's see CoT in action. We'll ask the same math question with and without chain-of-thought and compare the results.
The correct answer: 7 notebooks at $4 = $28, with 15% discount that's $28 * 0.85 = $23.80, so change = $50 - $23.80 = $26.20. The CoT version walks through these exact steps. The direct version sometimes collapses them into one jump and gets the discount math wrong.
Manual Chain-of-Thought — Writing the Reasoning Template
Manual CoT means you provide one or more worked examples in the prompt, showing the model exactly how to reason. This is the original technique from the 2022 Google Brain paper that kicked off the CoT revolution. It's the most reliable approach when you know the reasoning pattern your problem requires.
The idea is simple: you give the model a solved example with explicit reasoning steps, then ask it to solve a new problem in the same style. The model mimics your reasoning structure.
The correct calculation: 250 4 = 1000 widgets total, 12% rejected means 1000 0.88 = 880 widgets pass. The model follows the same step-by-step structure from our example — calculate the total first, then apply the percentage.
I use manual CoT whenever the reasoning pattern isn't obvious from the question alone. Multi-step percentage calculations, unit conversions, and problems where the order of operations matters — these all benefit from a worked example that demonstrates the exact reasoning flow.
What happens when we use multiple worked examples? More examples generally improve accuracy, but with diminishing returns. Let's build a function that makes this easy to test.
The correct answer: full year at monthly rate = 12 $15 = $180, deal price = 8 $15 = $120, savings = $180 - $120 = $60. The model follows the same step-by-step decomposition pattern from our example.
Zero-Shot CoT — "Let's Think Step by Step"
Zero-shot CoT requires no worked examples at all. You append "Let's think step by step" (or a similar phrase) to the end of your question, and the model generates its own reasoning chain. This was discovered by Kojima et al. in 2022 — a surprisingly simple trick that produces large accuracy gains on arithmetic, symbolic reasoning, and logic problems.
The phrase itself isn't magic. "Think through this carefully," "Reason about this step by step," and "Show your work" all trigger similar reasoning behavior. What matters is that the prompt explicitly tells the model to generate intermediate reasoning tokens before the final answer.
There is a catch. The raw CoT output mixes reasoning and answer into one block of text. For a pipeline that needs a clean answer, that's a problem. Let's fix that with a two-step approach.
The correct answer: with "buy 3, get 1 free," Sam gets 1 free book for every 3 he buys. 7 books means he gets 1 free (for the first 3), and then another 1 free (for the next 3), so he pays for 5 books. 5 * $12 = $60. The two-step approach gives you clean output your code can parse, while still getting the accuracy benefit of reasoning.
Practice: Build a CoT Prompt for Logic Problems
Time to practice. This exercise tests whether you understand how to structure a manual chain-of-thought prompt with worked examples.
Write a function build_cot_prompt(question, examples) that builds a chain-of-thought prompt string.
Rules:
examples is a list of tuples: [(question_str, reasoning_str), ...]"Solve the following problem. Show your reasoning step by step before giving the final answer.""\nExample:\nQ: {question}\nA: {reasoning}""\nNow solve this:\nQ: {question}\nA: Let me work through this step by step."Auto-CoT — Generating Examples Automatically
Manual CoT works well, but writing good worked examples takes effort. What if you could have the LLM generate the examples for you? That's the idea behind Auto-CoT, introduced by Zhang et al. in 2022. The process has two phases: first, cluster your questions by type; second, ask the model to generate a reasoning chain for one representative question from each cluster.
In practice, I've found that a simpler version works for most use cases: take a few sample questions, let the model solve them with zero-shot CoT, and use those generated solutions as few-shot examples for harder problems. The key insight is that even imperfect auto-generated examples improve accuracy over zero-shot alone.
Those auto-generated examples now serve as few-shot demonstrations for harder problems. The model has already produced step-by-step reasoning — we just reuse it.
The correct answer: 40% of 120 = 48 engineers, 10% bonus on $80,000 = $8,000 per engineer, total = 48 * $8,000 = $384,000. Auto-CoT saved us from writing the worked examples by hand, and the model still follows a clean step-by-step reasoning pattern.
Measuring the Accuracy Impact of CoT
I used to take "CoT improves accuracy" at face value until I ran my own benchmarks and discovered it varied wildly by problem type. Some categories saw a 40% jump; others saw zero improvement. The only way to know is to measure. Let's build a simple benchmark that compares direct prompting and zero-shot CoT on problems with known answers.
Your exact results will vary by run, but CoT consistently scores equal or higher than direct prompting on these arithmetic problems. The gap widens on harder problems — multi-step word problems, percentage-of-percentage calculations, and rate-time-distance questions are where CoT truly shines.
Building a Reusable CoT Solver
Let's combine everything into a class that picks the right CoT technique based on whether you provide examples or not. This is the pattern I use in production — a single interface that handles manual CoT, zero-shot CoT, and Auto-CoT behind the scenes.
The _call_llm helper keeps API calls DRY. The zero-shot strategy uses the same two-step approach we built earlier — generate reasoning, then extract a clean answer. Adding the manual and auto strategies follows the same pattern.
The correct answer: 85 $500 = $42,500 total bonuses, taxed at 22% means each person keeps 78%, so $42,500 0.78 = $33,150. Both strategies should arrive at this. The class gives you a clean API — pass examples for manual CoT, pass sample_questions for Auto-CoT, or pass neither for zero-shot.
When Chain-of-Thought Doesn't Help (and Hurts)
CoT isn't a universal improvement. I've seen cases where it actually degrades performance, and understanding when is as important as knowing how to use it.
Simple factual recall doesn't benefit from CoT. If you ask "What is the capital of France?" and add "Let's think step by step," the model might overthink: "Well, France is a country in Europe. Its major cities include Paris, Lyon, Marseille..." and sometimes talk itself into a wrong answer by considering too many alternatives. For direct knowledge retrieval, skip CoT entirely.
Classification tasks with clear categories are another case. Sentiment analysis ("Is this review positive or negative?") works better with direct prompting. CoT adds reasoning tokens that just restate the obvious — "The review mentions 'terrible' and 'disappointed,' so the sentiment is negative." That reasoning doesn't improve accuracy; it just costs more tokens.
# Unnecessary CoT for factual recall
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content":
"What is the boiling point of water in "
"Celsius? Let's think step by step."
}],
temperature=0,
)
print(response.choices[0].message.content)
print(f"Tokens: {response.usage.total_tokens}")# Direct prompting — faster, cheaper, same accuracy
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content":
"What is the boiling point of water in "
"Celsius? Answer with just the number."
}],
temperature=0,
)
print(response.choices[0].message.content)
print(f"Tokens: {response.usage.total_tokens}")Common Mistakes and How to Fix Them
After building CoT prompts for dozens of production applications, I see the same mistakes repeated. Here are the top four.
Mistake 1: Worked examples that don't match the reasoning pattern. You provide an addition example for a percentage problem. The model follows the addition structure and gets confused when percentages appear. Fix: your examples must share the same reasoning structure as the target problem, not just the same topic.
Mistake 2: Asking for reasoning but ignoring it. You add "think step by step" but then parse only the last line. If the model's reasoning contains an error, you miss it. Fix: log the full reasoning for debugging and use the two-step extraction approach to get a clean answer separately.
Mistake 3: Using CoT at temperature > 0 for deterministic problems. If you need a specific correct answer to a math problem, use temperature=0. Higher temperatures introduce randomness in the reasoning chain, which can cascade — one wrong intermediate step leads to a wrong final answer. Save temperature > 0 for creative tasks.
Mistake 4: Too many steps in the reasoning chain. More steps aren't always better. If your worked example has 12 steps for a 3-step problem, the model might invent unnecessary intermediate calculations. Keep your examples concise — match the number of reasoning steps to the actual complexity of the problem.
Practice: Accuracy Comparison Function
In this exercise, you'll build a function that compares accuracy between two sets of results — the kind of analysis you'd need when benchmarking CoT against direct prompting.
Write a function compare_accuracy(direct_results, cot_results) that takes two lists of booleans (True = correct, False = wrong) and returns a dictionary with the comparison.
Rules:
direct_results and cot_results are lists of booleans of the same length"direct_accuracy" (float, 0 to 1), "cot_accuracy" (float, 0 to 1), "improvement" (float, cot - direct), and "winner" (string: "cot", "direct", or "tie")"winner" is "cot" if cot accuracy > direct accuracy, "direct" if direct > cot, "tie" if equalFrequently Asked Questions
Does CoT work with all LLMs?
CoT is most effective on larger models (GPT-4o, Claude 3.5, Gemini 1.5 Pro). Smaller models with fewer than ~10B parameters often don't benefit — they struggle to generate coherent reasoning chains. If you're using a small model, test before assuming CoT will help.
How many few-shot examples should I use for manual CoT?
For most tasks, 2-3 examples hit the sweet spot. One example teaches the format but gives the model only one reasoning pattern to mimic. Two to three examples show variation and make the pattern more robust. Beyond 4-5 examples, you hit diminishing returns and start consuming significant context window space.
Can I combine CoT with system prompts?
Yes, and it's a powerful combination. Set the system prompt to define the role and constraints ("You are a math tutor. Always show your work."), then use CoT in the user message for the specific problem. The system prompt ensures consistent formatting across all responses.
What's the difference between CoT and "scratchpad" prompting?
Scratchpad prompting asks the model to use a designated area (e.g., <scratchpad>...</scratchpad> tags) for intermediate work. It's CoT with explicit structure markers. This is useful when you need to parse the reasoning separately from the answer — the tags make extraction trivial.