Advanced Reasoning: Tree-of-Thought, Self-Consistency, and Skeleton-of-Thought
Chain-of-thought prompting gives your LLM a single reasoning path. But what if that path is wrong? A student who solves a math problem once and hands it in is far less reliable than one who solves it three different ways and checks whether the answers agree. That is the core insight behind the techniques in this tutorial.
You will build three advanced reasoning systems in Python: Tree-of-Thought (explore multiple reasoning branches and pick the best), Self-Consistency (sample many answers and let majority voting surface the correct one), and Skeleton-of-Thought (generate an outline first, then flesh it out in parallel for speed). By the end, you will have a decision framework for choosing the right technique for any task.
Why Chain-of-Thought Is Not Enough
I remember the first time chain-of-thought prompting failed me spectacularly. I had a planning problem — scheduling five meetings across three rooms with constraints — and CoT produced a confident, step-by-step solution that violated two constraints. The reasoning looked right at every step, but the model never backtracked to reconsider an early decision that turned out to be wrong.
This is the fundamental limitation: chain-of-thought is a single pass. The model commits to its first interpretation and plows forward. For problems that require exploring alternatives, comparing candidate solutions, or recovering from wrong turns, one reasoning chain is not enough.
Three techniques address this, each from a different angle:
Self-Consistency — Majority Voting for Correct Answers
I am starting with Self-Consistency because it is the simplest to implement and the most broadly useful. The idea, introduced by Wang et al. (2022), is almost embarrassingly straightforward: ask the LLM the same question multiple times with temperature turned up, extract the final answer from each response, and pick the answer that appears most often.
Why does this work? Because with temperature > 0 the model samples different reasoning paths each time. Some paths lead to the right answer, some lead to wrong ones. But correct reasoning paths tend to converge on the same answer, while wrong paths scatter across different incorrect answers. Majority voting filters out the noise.
Here is the full implementation. The self_consistency function takes a prompt, samples it n times at a given temperature, extracts the final numerical answer from each response, and returns the majority answer along with all the sampled reasoning paths.
The classic test for self-consistency is a word problem where a single CoT attempt sometimes gets the arithmetic wrong. This problem has a definite answer, making it easy to verify whether majority voting helps.
The correct answer is $28 (3 x $4 + 5 x $2 = $22, so $50 - $22 = $28). With five samples, you will typically see 4 or 5 votes for "$28" and maybe one outlier that made an arithmetic mistake. That is the power of self-consistency: even when individual reasoning paths occasionally fail, the majority converges on the correct answer.
Write a function check_agreement(answers) that takes a list of string answers (e.g., ["28", "28", "22", "28", "30"]) and returns a dictionary with three keys:
"winner": the most common answer (string)"votes": how many times the winner appeared (int)"total": total number of answers (int)If the list is empty, return {"winner": None, "votes": 0, "total": 0}.
Tree-of-Thought — Exploring Multiple Reasoning Branches
Self-consistency samples complete reasoning paths independently. Tree-of-Thought (ToT), introduced by Yao et al. (2023), goes further: it branches at intermediate steps, evaluates partial solutions, and prunes bad branches before they waste tokens. Think of it as the difference between having five students solve a problem independently (self-consistency) versus having one student sketch three different opening moves, evaluate which looks most promising, and only continue with the best one (ToT).
The ToT loop has three phases that repeat at each step:
Here is a practical ToT implementation. I have kept it deliberately readable rather than optimizing for generality — you can see exactly what happens at each step.
The evaluation step is where ToT gets its power, and honestly, it is the part I find most elegant. Instead of blindly continuing every branch, we ask the LLM to rate each candidate. This is the pruning mechanism — bad branches get cut early, saving tokens and improving final answer quality.
With generation and evaluation in place, the full ToT loop ties them together. At each depth level, we generate candidates, score them, keep only the best, and use it as the starting point for the next round.
Let us test it on a classic planning problem. This type of problem trips up standard CoT because an early wrong assignment cascades through the rest of the solution.
The valid solution is: Alice in the morning, Bob and Carol in the afternoon. This satisfies all four constraints. Notice how the ToT process generates multiple candidate assignments at the first step, scores them against the constraints, and only carries forward the one that does not violate any rule.
Skeleton-of-Thought — Speed Through Parallel Expansion
ToT and self-consistency both trade speed for accuracy — you make more API calls to get better answers. Skeleton-of-Thought (Ning et al., 2023) goes in the opposite direction: it aims to produce long-form content faster than a single monolithic prompt, while maintaining or improving quality.
The trick is decomposition. Instead of asking the LLM to write a full essay in one pass, you ask it to first produce a skeleton — just the key points — and then expand each point independently. Because the expansion calls do not depend on each other, they can run in parallel.
I find SoT particularly useful for generating documentation, blog post drafts, and comprehensive explanations where the structure is clear but the content is long. It regularly cuts my wall-clock generation time by 40-60% for content longer than 500 words.
Each skeleton point gets expanded independently. Since these expansions do not share state, we can fire them all at once with asyncio.gather — this is where the speed gain comes from.
With six skeleton points, the parallel version makes one skeleton call plus six expansion calls — but the six expansion calls run concurrently. In wall-clock time, this is roughly equivalent to two sequential calls instead of seven. The larger your skeleton, the bigger the speed advantage.
Write a function parse_skeleton(text) that takes a multi-line string containing a numbered list and returns a Python list of the point texts (without numbers or leading whitespace).
For example, given:
1. First point here
2. Second point
Some other line
3. Third pointIt should return ["First point here", "Second point", "Third point"].
Only lines that start with a digit followed by a period and a space should be captured.
Comparing the Three Techniques Side by Side
Each technique solves a different problem. Using the wrong one wastes API calls without improving results. This comparison table is the reference I keep coming back to when deciding which technique to reach for.
| Dimension | Self-Consistency | Tree-of-Thought | Skeleton-of-Thought |
|---|---|---|---|
| Core idea | Sample many, vote on answer | Branch, evaluate, prune | Outline first, expand in parallel |
| Best for | Single-answer questions | Planning, puzzles, constraint problems | Long-form content generation |
| API calls | n (typically 5) | branches x depth (typically 9-12) | 1 + skeleton points (typically 5-7) |
| Latency | Sequential (n calls) | Sequential (depth x branches calls) | Mostly parallel after skeleton |
| Accuracy gain | Moderate (+5-15% on math/logic) | High for suitable problems (+20-30%) | Not about accuracy — about speed |
| When to skip | Deterministic tasks (temp=0 is fine) | Simple Q&A, classification | Short answers, dependent sections |
Decision Framework — Which Technique to Use
This decision framework is deliberately simple. In practice, I start with standard chain-of-thought for every new task. If the results are inconsistent across runs, I switch to self-consistency. If the model keeps making early mistakes that cascade, I try tree-of-thought. And if latency is the bottleneck on long-form content, I reach for skeleton-of-thought.
Real-World Example — Multi-Step Problem with Verification
This is the pattern I use most often in production: combining techniques for different stages of a problem. Imagine you are building an AI assistant that answers customer questions about pricing. The answer must be correct — wrong pricing is a liability — so we will use ToT to reason through the calculation and self-consistency to verify the final number.
The correct prorated charge is $25. Basic covers the remaining 15 days at $29/30 per day = $14.50, and Pro for those 15 days costs $79/30 x 15 = $39.50. The difference is $39.50 - $14.50 = $25.00. When both ToT and self-consistency converge on the same number, you can trust it. When they disagree, route the question to a human reviewer.
Common Mistakes and How to Fix Them
After building these systems for several projects, I have seen the same mistakes repeatedly. Here are the ones that cost the most time.
# WRONG: temperature=0 gives identical responses
result = await self_consistency(prompt, n=5, temperature=0.0)
# All 5 answers will be the same — voting is pointless# RIGHT: moderate temperature creates diverse reasoning paths
result = await self_consistency(prompt, n=5, temperature=0.7)
# Each sample explores a different reasoning pathThe second most common mistake is applying ToT to problems where a single CoT pass works fine. Every ToT step multiplies your API cost. If standard CoT already gets the right answer 95% of the time, adding ToT burns tokens without meaningful improvement.
# OVERKILL: ToT for a simple lookup question
result = await tree_of_thought(
"What is the capital of France?",
depth=3, branches=3 # 9+ API calls for a trivial question
)# RIGHT: use ToT only when exploration helps
result = await tree_of_thought(
"Plan a 3-day conference schedule with 20 talks, "
"5 rooms, and no speaker conflicts.",
depth=3, branches=3 # exploration is worth it here
)Write a function recommend(task_description) that takes a string describing a task and returns one of these strings: "CoT", "Self-Consistency", "Tree-of-Thought", or "Skeleton-of-Thought".
Use these rules:
"Tree-of-Thought""Skeleton-of-Thought""Self-Consistency""CoT"Performance and Cost Considerations
These techniques are powerful but expensive. Every branch in ToT and every sample in self-consistency is an API call you pay for. Here is a realistic cost breakdown based on gpt-4o-mini pricing.
At gpt-4o-mini prices, even 13 calls for a full ToT run costs fractions of a cent. The real concern is latency — 13 sequential API calls add up to several seconds. For production systems, my approach is: use self-consistency (5 parallel calls) by default, reserve ToT for problems where it demonstrably outperforms, and use SoT to keep latency under your SLA for long-form generation.
Frequently Asked Questions
Can I use self-consistency with non-numerical answers?
Yes — replace the numerical extraction with a string normalization step. For classification tasks, extract the class label; for yes/no questions, extract "yes" or "no". The key is that your extraction function maps semantically identical answers to the same string before voting.
How is Tree-of-Thought different from running CoT multiple times?
CoT with multiple samples (self-consistency) generates complete, independent reasoning chains. ToT generates partial chains, evaluates them, and only continues the best ones. This means ToT can catch errors mid-reasoning and avoid wasting tokens on doomed paths. The tradeoff is that ToT requires more structured prompting and more API calls per step.
Does Skeleton-of-Thought work with all models?
SoT works with any model that can follow structured instructions. The quality depends on the model producing a genuine skeleton rather than a full essay in the skeleton step. I have found that adding "Do not expand on the points — just list them" to the skeleton prompt is critical for smaller models that tend to be verbose.
Can these techniques work with open-source models via Ollama?
Absolutely. Replace the OpenAI client with an Ollama-compatible client and the techniques work identically. The accuracy improvements from self-consistency are often even larger with open-source models, because they have higher variance in individual responses.
Summary
Three techniques, three problems they solve. Self-Consistency samples multiple reasoning paths and lets majority voting select the most reliable answer — use it when you need accuracy on questions with a single correct answer. Tree-of-Thought generates, evaluates, and prunes reasoning branches at each step — use it for planning, puzzles, and problems where early mistakes cascade. Skeleton-of-Thought decomposes generation into an outline plus parallel expansion — use it when latency matters more than reasoning depth.
Start with standard chain-of-thought. Upgrade to self-consistency when results are inconsistent. Switch to tree-of-thought when the model keeps making early errors. Reach for skeleton-of-thought when generation is too slow. And when the stakes are high, combine them.
Complete Code
Here is every code block from this tutorial combined into a single runnable script. Replace the API key and run top-to-bottom.