Advanced Reasoning: Tree-of-Thought, Self-Consistency, and Skeleton-of-Thought

Intermediate90 min3 exercises45 XP

Prerequisites

0/3 exercises

Chain-of-thought prompting gives your LLM a single reasoning path. But what if that path is wrong? A student who solves a math problem once and hands it in is far less reliable than one who solves it three different ways and checks whether the answers agree. That is the core insight behind the techniques in this tutorial.

You will build three advanced reasoning systems in Python: Tree-of-Thought (explore multiple reasoning branches and pick the best), Self-Consistency (sample many answers and let majority voting surface the correct one), and Skeleton-of-Thought (generate an outline first, then flesh it out in parallel for speed). By the end, you will have a decision framework for choosing the right technique for any task.

Why Chain-of-Thought Is Not Enough

I remember the first time chain-of-thought prompting failed me spectacularly. I had a planning problem — scheduling five meetings across three rooms with constraints — and CoT produced a confident, step-by-step solution that violated two constraints. The reasoning looked right at every step, but the model never backtracked to reconsider an early decision that turned out to be wrong.

This is the fundamental limitation: chain-of-thought is a single pass. The model commits to its first interpretation and plows forward. For problems that require exploring alternatives, comparing candidate solutions, or recovering from wrong turns, one reasoning chain is not enough.

Three techniques address this, each from a different angle:

Tree-of-Thought (ToT): Generate multiple reasoning branches at each step, evaluate them, and pursue only the most promising ones. Best for planning and multi-step puzzles.

Self-Consistency (SC): Sample the same question multiple times with temperature > 0, then take the majority answer. Best for questions with a single correct answer.

Skeleton-of-Thought (SoT): Generate a skeleton outline first, then expand each point independently (even in parallel). Best for long-form generation where speed matters.

Setup — install openai and create a reusable helper

Loading editor...

Self-Consistency — Majority Voting for Correct Answers

I am starting with Self-Consistency because it is the simplest to implement and the most broadly useful. The idea, introduced by Wang et al. (2022), is almost embarrassingly straightforward: ask the LLM the same question multiple times with temperature turned up, extract the final answer from each response, and pick the answer that appears most often.

Why does this work? Because with temperature > 0 the model samples different reasoning paths each time. Some paths lead to the right answer, some lead to wrong ones. But correct reasoning paths tend to converge on the same answer, while wrong paths scatter across different incorrect answers. Majority voting filters out the noise.

Here is the full implementation. The self_consistency function takes a prompt, samples it n times at a given temperature, extracts the final numerical answer from each response, and returns the majority answer along with all the sampled reasoning paths.

Self-Consistency implementation with majority voting

Loading editor...

The classic test for self-consistency is a word problem where a single CoT attempt sometimes gets the arithmetic wrong. This problem has a definite answer, making it easy to verify whether majority voting helps.

Testing self-consistency on an arithmetic word problem

Loading editor...

The correct answer is $28 (3 x $4 + 5 x $2 = $22, so $50 - $22 = $28). With five samples, you will typically see 4 or 5 votes for "$28" and maybe one outlier that made an arithmetic mistake. That is the power of self-consistency: even when individual reasoning paths occasionally fail, the majority converges on the correct answer.

Exercise 1: Build a Self-Consistency Answer Checker

Write Code

Write a function check_agreement(answers) that takes a list of string answers (e.g., ["28", "28", "22", "28", "30"]) and returns a dictionary with three keys:

"winner": the most common answer (string)

"votes": how many times the winner appeared (int)

"total": total number of answers (int)

If the list is empty, return {"winner": None, "votes": 0, "total": 0}.

Loading editor...

Tree-of-Thought — Exploring Multiple Reasoning Branches

Self-consistency samples complete reasoning paths independently. Tree-of-Thought (ToT), introduced by Yao et al. (2023), goes further: it branches at intermediate steps, evaluates partial solutions, and prunes bad branches before they waste tokens. Think of it as the difference between having five students solve a problem independently (self-consistency) versus having one student sketch three different opening moves, evaluate which looks most promising, and only continue with the best one (ToT).

The ToT loop has three phases that repeat at each step:

Generate — produce multiple candidate "thoughts" (partial solutions) for the current step

Evaluate — score each candidate using the LLM as a judge

Select — keep only the top-scoring candidates and move to the next step

Here is a practical ToT implementation. I have kept it deliberately readable rather than optimizing for generality — you can see exactly what happens at each step.

Step 1 — Generate candidate thoughts

Loading editor...

The evaluation step is where ToT gets its power, and honestly, it is the part I find most elegant. Instead of blindly continuing every branch, we ask the LLM to rate each candidate. This is the pruning mechanism — bad branches get cut early, saving tokens and improving final answer quality.

Step 2 — Evaluate and score each candidate

Loading editor...

With generation and evaluation in place, the full ToT loop ties them together. At each depth level, we generate candidates, score them, keep only the best, and use it as the starting point for the next round.

The complete Tree-of-Thought loop

Loading editor...

Let us test it on a classic planning problem. This type of problem trips up standard CoT because an early wrong assignment cascades through the rest of the solution.

Testing ToT on a constraint satisfaction problem

Loading editor...

The valid solution is: Alice in the morning, Bob and Carol in the afternoon. This satisfies all four constraints. Notice how the ToT process generates multiple candidate assignments at the first step, scores them against the constraints, and only carries forward the one that does not violate any rule.

Skeleton-of-Thought — Speed Through Parallel Expansion

ToT and self-consistency both trade speed for accuracy — you make more API calls to get better answers. Skeleton-of-Thought (Ning et al., 2023) goes in the opposite direction: it aims to produce long-form content faster than a single monolithic prompt, while maintaining or improving quality.

The trick is decomposition. Instead of asking the LLM to write a full essay in one pass, you ask it to first produce a skeleton — just the key points — and then expand each point independently. Because the expansion calls do not depend on each other, they can run in parallel.

I find SoT particularly useful for generating documentation, blog post drafts, and comprehensive explanations where the structure is clear but the content is long. It regularly cuts my wall-clock generation time by 40-60% for content longer than 500 words.

Generating the skeleton outline

Loading editor...

Each skeleton point gets expanded independently. Since these expansions do not share state, we can fire them all at once with asyncio.gather — this is where the speed gain comes from.

Parallel expansion with asyncio.gather

Loading editor...

With six skeleton points, the parallel version makes one skeleton call plus six expansion calls — but the six expansion calls run concurrently. In wall-clock time, this is roughly equivalent to two sequential calls instead of seven. The larger your skeleton, the bigger the speed advantage.

Exercise 2: Parse a Skeleton Outline

Write Code

Write a function parse_skeleton(text) that takes a multi-line string containing a numbered list and returns a Python list of the point texts (without numbers or leading whitespace).

For example, given:

1. First point here
2. Second point
Some other line
3. Third point

It should return ["First point here", "Second point", "Third point"].

Only lines that start with a digit followed by a period and a space should be captured.

Loading editor...

Comparing the Three Techniques Side by Side

Each technique solves a different problem. Using the wrong one wastes API calls without improving results. This comparison table is the reference I keep coming back to when deciding which technique to reach for.

Dimension	Self-Consistency	Tree-of-Thought	Skeleton-of-Thought
Core idea	Sample many, vote on answer	Branch, evaluate, prune	Outline first, expand in parallel
Best for	Single-answer questions	Planning, puzzles, constraint problems	Long-form content generation
API calls	n (typically 5)	branches x depth (typically 9-12)	1 + skeleton points (typically 5-7)
Latency	Sequential (n calls)	Sequential (depth x branches calls)	Mostly parallel after skeleton
Accuracy gain	Moderate (+5-15% on math/logic)	High for suitable problems (+20-30%)	Not about accuracy — about speed
When to skip	Deterministic tasks (temp=0 is fine)	Simple Q&A, classification	Short answers, dependent sections

Decision Framework — Which Technique to Use

A programmatic decision framework

Loading editor...

This decision framework is deliberately simple. In practice, I start with standard chain-of-thought for every new task. If the results are inconsistent across runs, I switch to self-consistency. If the model keeps making early mistakes that cascade, I try tree-of-thought. And if latency is the bottleneck on long-form content, I reach for skeleton-of-thought.

Real-World Example — Multi-Step Problem with Verification

This is the pattern I use most often in production: combining techniques for different stages of a problem. Imagine you are building an AI assistant that answers customer questions about pricing. The answer must be correct — wrong pricing is a liability — so we will use ToT to reason through the calculation and self-consistency to verify the final number.

Combining ToT and self-consistency for verified pricing answers

Loading editor...

The correct prorated charge is $25. Basic covers the remaining 15 days at $29/30 per day = $14.50, and Pro for those 15 days costs $79/30 x 15 = $39.50. The difference is $39.50 - $14.50 = $25.00. When both ToT and self-consistency converge on the same number, you can trust it. When they disagree, route the question to a human reviewer.

Common Mistakes and How to Fix Them

After building these systems for several projects, I have seen the same mistakes repeatedly. Here are the ones that cost the most time.

Using temperature=0 for self-consistency

# WRONG: temperature=0 gives identical responses
result = await self_consistency(prompt, n=5, temperature=0.0)
# All 5 answers will be the same — voting is pointless

Use temperature 0.5-0.8 for diversity

# RIGHT: moderate temperature creates diverse reasoning paths
result = await self_consistency(prompt, n=5, temperature=0.7)
# Each sample explores a different reasoning path

The second most common mistake is applying ToT to problems where a single CoT pass works fine. Every ToT step multiplies your API cost. If standard CoT already gets the right answer 95% of the time, adding ToT burns tokens without meaningful improvement.

ToT for simple factual questions

# OVERKILL: ToT for a simple lookup question
result = await tree_of_thought(
    "What is the capital of France?",
    depth=3, branches=3  # 9+ API calls for a trivial question
)

Save ToT for problems that need exploration

# RIGHT: use ToT only when exploration helps
result = await tree_of_thought(
    "Plan a 3-day conference schedule with 20 talks, "
    "5 rooms, and no speaker conflicts.",
    depth=3, branches=3  # exploration is worth it here
)

Exercise 3: Match Tasks to Techniques

Write Code

Write a function recommend(task_description) that takes a string describing a task and returns one of these strings: "CoT", "Self-Consistency", "Tree-of-Thought", or "Skeleton-of-Thought".

Use these rules:

If the description contains "puzzle" or "planning" or "schedule", return "Tree-of-Thought"

If the description contains "write" or "document" or "article", return "Skeleton-of-Thought"

If the description contains "calculate" or "solve" or "answer", return "Self-Consistency"

Otherwise, return "CoT"

Loading editor...

Performance and Cost Considerations

These techniques are powerful but expensive. Every branch in ToT and every sample in self-consistency is an API call you pay for. Here is a realistic cost breakdown based on gpt-4o-mini pricing.

Cost comparison across techniques

Loading editor...

At gpt-4o-mini prices, even 13 calls for a full ToT run costs fractions of a cent. The real concern is latency — 13 sequential API calls add up to several seconds. For production systems, my approach is: use self-consistency (5 parallel calls) by default, reserve ToT for problems where it demonstrably outperforms, and use SoT to keep latency under your SLA for long-form generation.

Frequently Asked Questions

Can I use self-consistency with non-numerical answers?

Yes — replace the numerical extraction with a string normalization step. For classification tasks, extract the class label; for yes/no questions, extract "yes" or "no". The key is that your extraction function maps semantically identical answers to the same string before voting.

How is Tree-of-Thought different from running CoT multiple times?

CoT with multiple samples (self-consistency) generates complete, independent reasoning chains. ToT generates partial chains, evaluates them, and only continues the best ones. This means ToT can catch errors mid-reasoning and avoid wasting tokens on doomed paths. The tradeoff is that ToT requires more structured prompting and more API calls per step.

Does Skeleton-of-Thought work with all models?

SoT works with any model that can follow structured instructions. The quality depends on the model producing a genuine skeleton rather than a full essay in the skeleton step. I have found that adding "Do not expand on the points — just list them" to the skeleton prompt is critical for smaller models that tend to be verbose.

Can these techniques work with open-source models via Ollama?

Absolutely. Replace the OpenAI client with an Ollama-compatible client and the techniques work identically. The accuracy improvements from self-consistency are often even larger with open-source models, because they have higher variance in individual responses.

Summary

Three techniques, three problems they solve. Self-Consistency samples multiple reasoning paths and lets majority voting select the most reliable answer — use it when you need accuracy on questions with a single correct answer. Tree-of-Thought generates, evaluates, and prunes reasoning branches at each step — use it for planning, puzzles, and problems where early mistakes cascade. Skeleton-of-Thought decomposes generation into an outline plus parallel expansion — use it when latency matters more than reasoning depth.

Start with standard chain-of-thought. Upgrade to self-consistency when results are inconsistent. Switch to tree-of-thought when the model keeps making early errors. Reach for skeleton-of-thought when generation is too slow. And when the stakes are high, combine them.

Complete Code

Here is every code block from this tutorial combined into a single runnable script. Replace the API key and run top-to-bottom.

Complete code — all techniques in one script

Loading editor...

References

Wang, X. et al. — "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR 2023. arXiv:2203.11171

Yao, S. et al. — "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS 2023. arXiv:2305.10601

Ning, X. et al. — "Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation." ICLR 2024. arXiv:2307.15337

Wei, J. et al. — "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. arXiv:2201.11903

OpenAI — Chat Completions API Reference. Link

Yao, S. et al. — "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. arXiv:2210.03629

Long, J. — "Large Language Model Guided Tree-of-Thought." 2023. arXiv:2305.08291

Why Chain-of-Thought Is Not Enough

Self-Consistency — Majority Voting for Correct Answers

Tree-of-Thought — Exploring Multiple Reasoning Branches

Skeleton-of-Thought — Speed Through Parallel Expansion

Comparing the Three Techniques Side by Side

Decision Framework — Which Technique to Use

Real-World Example — Multi-Step Problem with Verification

Common Mistakes and How to Fix Them

Performance and Cost Considerations

Frequently Asked Questions

Summary

Complete Code

References

Related Tutorials