Skip to main content

Advanced Reasoning: Tree-of-Thought, Self-Consistency, and Skeleton-of-Thought

Intermediate90 min3 exercises45 XP
0/3 exercises

Chain-of-thought prompting gives your LLM a single reasoning path. But what if that path is wrong? A student who solves a math problem once and hands it in is far less reliable than one who solves it three different ways and checks whether the answers agree. That is the core insight behind the techniques in this tutorial.

You will build three advanced reasoning systems in Python: Tree-of-Thought (explore multiple reasoning branches and pick the best), Self-Consistency (sample many answers and let majority voting surface the correct one), and Skeleton-of-Thought (generate an outline first, then flesh it out in parallel for speed). By the end, you will have a decision framework for choosing the right technique for any task.

Why Chain-of-Thought Is Not Enough

I remember the first time chain-of-thought prompting failed me spectacularly. I had a planning problem — scheduling five meetings across three rooms with constraints — and CoT produced a confident, step-by-step solution that violated two constraints. The reasoning looked right at every step, but the model never backtracked to reconsider an early decision that turned out to be wrong.

This is the fundamental limitation: chain-of-thought is a single pass. The model commits to its first interpretation and plows forward. For problems that require exploring alternatives, comparing candidate solutions, or recovering from wrong turns, one reasoning chain is not enough.

Three techniques address this, each from a different angle:

  • Tree-of-Thought (ToT): Generate multiple reasoning branches at each step, evaluate them, and pursue only the most promising ones. Best for planning and multi-step puzzles.
  • Self-Consistency (SC): Sample the same question multiple times with temperature > 0, then take the majority answer. Best for questions with a single correct answer.
  • Skeleton-of-Thought (SoT): Generate a skeleton outline first, then expand each point independently (even in parallel). Best for long-form generation where speed matters.
  • Setup — install openai and create a reusable helper
    Loading editor...

    Self-Consistency — Majority Voting for Correct Answers

    I am starting with Self-Consistency because it is the simplest to implement and the most broadly useful. The idea, introduced by Wang et al. (2022), is almost embarrassingly straightforward: ask the LLM the same question multiple times with temperature turned up, extract the final answer from each response, and pick the answer that appears most often.

    Why does this work? Because with temperature > 0 the model samples different reasoning paths each time. Some paths lead to the right answer, some lead to wrong ones. But correct reasoning paths tend to converge on the same answer, while wrong paths scatter across different incorrect answers. Majority voting filters out the noise.

    Here is the full implementation. The self_consistency function takes a prompt, samples it n times at a given temperature, extracts the final numerical answer from each response, and returns the majority answer along with all the sampled reasoning paths.

    Self-Consistency implementation with majority voting
    Loading editor...

    The classic test for self-consistency is a word problem where a single CoT attempt sometimes gets the arithmetic wrong. This problem has a definite answer, making it easy to verify whether majority voting helps.

    Testing self-consistency on an arithmetic word problem
    Loading editor...

    The correct answer is $28 (3 x $4 + 5 x $2 = $22, so $50 - $22 = $28). With five samples, you will typically see 4 or 5 votes for "$28" and maybe one outlier that made an arithmetic mistake. That is the power of self-consistency: even when individual reasoning paths occasionally fail, the majority converges on the correct answer.

    Exercise 1: Build a Self-Consistency Answer Checker
    Write Code

    Write a function check_agreement(answers) that takes a list of string answers (e.g., ["28", "28", "22", "28", "30"]) and returns a dictionary with three keys:

  • "winner": the most common answer (string)
  • "votes": how many times the winner appeared (int)
  • "total": total number of answers (int)
  • If the list is empty, return {"winner": None, "votes": 0, "total": 0}.

    Loading editor...

    Tree-of-Thought — Exploring Multiple Reasoning Branches

    Self-consistency samples complete reasoning paths independently. Tree-of-Thought (ToT), introduced by Yao et al. (2023), goes further: it branches at intermediate steps, evaluates partial solutions, and prunes bad branches before they waste tokens. Think of it as the difference between having five students solve a problem independently (self-consistency) versus having one student sketch three different opening moves, evaluate which looks most promising, and only continue with the best one (ToT).

    The ToT loop has three phases that repeat at each step:

  • Generate — produce multiple candidate "thoughts" (partial solutions) for the current step
  • Evaluate — score each candidate using the LLM as a judge
  • Select — keep only the top-scoring candidates and move to the next step
  • Here is a practical ToT implementation. I have kept it deliberately readable rather than optimizing for generality — you can see exactly what happens at each step.

    Step 1 — Generate candidate thoughts
    Loading editor...

    The evaluation step is where ToT gets its power, and honestly, it is the part I find most elegant. Instead of blindly continuing every branch, we ask the LLM to rate each candidate. This is the pruning mechanism — bad branches get cut early, saving tokens and improving final answer quality.

    Step 2 — Evaluate and score each candidate
    Loading editor...

    With generation and evaluation in place, the full ToT loop ties them together. At each depth level, we generate candidates, score them, keep only the best, and use it as the starting point for the next round.

    The complete Tree-of-Thought loop
    Loading editor...

    Let us test it on a classic planning problem. This type of problem trips up standard CoT because an early wrong assignment cascades through the rest of the solution.

    Testing ToT on a constraint satisfaction problem
    Loading editor...

    The valid solution is: Alice in the morning, Bob and Carol in the afternoon. This satisfies all four constraints. Notice how the ToT process generates multiple candidate assignments at the first step, scores them against the constraints, and only carries forward the one that does not violate any rule.


    Skeleton-of-Thought — Speed Through Parallel Expansion

    ToT and self-consistency both trade speed for accuracy — you make more API calls to get better answers. Skeleton-of-Thought (Ning et al., 2023) goes in the opposite direction: it aims to produce long-form content faster than a single monolithic prompt, while maintaining or improving quality.

    The trick is decomposition. Instead of asking the LLM to write a full essay in one pass, you ask it to first produce a skeleton — just the key points — and then expand each point independently. Because the expansion calls do not depend on each other, they can run in parallel.

    I find SoT particularly useful for generating documentation, blog post drafts, and comprehensive explanations where the structure is clear but the content is long. It regularly cuts my wall-clock generation time by 40-60% for content longer than 500 words.

    Generating the skeleton outline
    Loading editor...

    Each skeleton point gets expanded independently. Since these expansions do not share state, we can fire them all at once with asyncio.gather — this is where the speed gain comes from.

    Parallel expansion with asyncio.gather
    Loading editor...

    With six skeleton points, the parallel version makes one skeleton call plus six expansion calls — but the six expansion calls run concurrently. In wall-clock time, this is roughly equivalent to two sequential calls instead of seven. The larger your skeleton, the bigger the speed advantage.

    Exercise 2: Parse a Skeleton Outline
    Write Code

    Write a function parse_skeleton(text) that takes a multi-line string containing a numbered list and returns a Python list of the point texts (without numbers or leading whitespace).

    For example, given:

    1. First point here
    2. Second point
    Some other line
    3. Third point

    It should return ["First point here", "Second point", "Third point"].

    Only lines that start with a digit followed by a period and a space should be captured.

    Loading editor...

    Comparing the Three Techniques Side by Side

    Each technique solves a different problem. Using the wrong one wastes API calls without improving results. This comparison table is the reference I keep coming back to when deciding which technique to reach for.

    DimensionSelf-ConsistencyTree-of-ThoughtSkeleton-of-Thought
    Core ideaSample many, vote on answerBranch, evaluate, pruneOutline first, expand in parallel
    Best forSingle-answer questionsPlanning, puzzles, constraint problemsLong-form content generation
    API callsn (typically 5)branches x depth (typically 9-12)1 + skeleton points (typically 5-7)
    LatencySequential (n calls)Sequential (depth x branches calls)Mostly parallel after skeleton
    Accuracy gainModerate (+5-15% on math/logic)High for suitable problems (+20-30%)Not about accuracy — about speed
    When to skipDeterministic tasks (temp=0 is fine)Simple Q&A, classificationShort answers, dependent sections

    Decision Framework — Which Technique to Use

    A programmatic decision framework
    Loading editor...

    This decision framework is deliberately simple. In practice, I start with standard chain-of-thought for every new task. If the results are inconsistent across runs, I switch to self-consistency. If the model keeps making early mistakes that cascade, I try tree-of-thought. And if latency is the bottleneck on long-form content, I reach for skeleton-of-thought.

    Real-World Example — Multi-Step Problem with Verification

    This is the pattern I use most often in production: combining techniques for different stages of a problem. Imagine you are building an AI assistant that answers customer questions about pricing. The answer must be correct — wrong pricing is a liability — so we will use ToT to reason through the calculation and self-consistency to verify the final number.

    Combining ToT and self-consistency for verified pricing answers
    Loading editor...

    The correct prorated charge is $25. Basic covers the remaining 15 days at $29/30 per day = $14.50, and Pro for those 15 days costs $79/30 x 15 = $39.50. The difference is $39.50 - $14.50 = $25.00. When both ToT and self-consistency converge on the same number, you can trust it. When they disagree, route the question to a human reviewer.

    Common Mistakes and How to Fix Them

    After building these systems for several projects, I have seen the same mistakes repeatedly. Here are the ones that cost the most time.

    Using temperature=0 for self-consistency
    # WRONG: temperature=0 gives identical responses
    result = await self_consistency(prompt, n=5, temperature=0.0)
    # All 5 answers will be the same — voting is pointless
    Use temperature 0.5-0.8 for diversity
    # RIGHT: moderate temperature creates diverse reasoning paths
    result = await self_consistency(prompt, n=5, temperature=0.7)
    # Each sample explores a different reasoning path

    The second most common mistake is applying ToT to problems where a single CoT pass works fine. Every ToT step multiplies your API cost. If standard CoT already gets the right answer 95% of the time, adding ToT burns tokens without meaningful improvement.

    ToT for simple factual questions
    # OVERKILL: ToT for a simple lookup question
    result = await tree_of_thought(
        "What is the capital of France?",
        depth=3, branches=3  # 9+ API calls for a trivial question
    )
    Save ToT for problems that need exploration
    # RIGHT: use ToT only when exploration helps
    result = await tree_of_thought(
        "Plan a 3-day conference schedule with 20 talks, "
        "5 rooms, and no speaker conflicts.",
        depth=3, branches=3  # exploration is worth it here
    )
    Exercise 3: Match Tasks to Techniques
    Write Code

    Write a function recommend(task_description) that takes a string describing a task and returns one of these strings: "CoT", "Self-Consistency", "Tree-of-Thought", or "Skeleton-of-Thought".

    Use these rules:

  • If the description contains "puzzle" or "planning" or "schedule", return "Tree-of-Thought"
  • If the description contains "write" or "document" or "article", return "Skeleton-of-Thought"
  • If the description contains "calculate" or "solve" or "answer", return "Self-Consistency"
  • Otherwise, return "CoT"
  • Loading editor...

    Performance and Cost Considerations

    These techniques are powerful but expensive. Every branch in ToT and every sample in self-consistency is an API call you pay for. Here is a realistic cost breakdown based on gpt-4o-mini pricing.

    Cost comparison across techniques
    Loading editor...

    At gpt-4o-mini prices, even 13 calls for a full ToT run costs fractions of a cent. The real concern is latency — 13 sequential API calls add up to several seconds. For production systems, my approach is: use self-consistency (5 parallel calls) by default, reserve ToT for problems where it demonstrably outperforms, and use SoT to keep latency under your SLA for long-form generation.

    Frequently Asked Questions

    Can I use self-consistency with non-numerical answers?

    Yes — replace the numerical extraction with a string normalization step. For classification tasks, extract the class label; for yes/no questions, extract "yes" or "no". The key is that your extraction function maps semantically identical answers to the same string before voting.

    How is Tree-of-Thought different from running CoT multiple times?

    CoT with multiple samples (self-consistency) generates complete, independent reasoning chains. ToT generates partial chains, evaluates them, and only continues the best ones. This means ToT can catch errors mid-reasoning and avoid wasting tokens on doomed paths. The tradeoff is that ToT requires more structured prompting and more API calls per step.

    Does Skeleton-of-Thought work with all models?

    SoT works with any model that can follow structured instructions. The quality depends on the model producing a genuine skeleton rather than a full essay in the skeleton step. I have found that adding "Do not expand on the points — just list them" to the skeleton prompt is critical for smaller models that tend to be verbose.

    Can these techniques work with open-source models via Ollama?

    Absolutely. Replace the OpenAI client with an Ollama-compatible client and the techniques work identically. The accuracy improvements from self-consistency are often even larger with open-source models, because they have higher variance in individual responses.

    Summary

    Three techniques, three problems they solve. Self-Consistency samples multiple reasoning paths and lets majority voting select the most reliable answer — use it when you need accuracy on questions with a single correct answer. Tree-of-Thought generates, evaluates, and prunes reasoning branches at each step — use it for planning, puzzles, and problems where early mistakes cascade. Skeleton-of-Thought decomposes generation into an outline plus parallel expansion — use it when latency matters more than reasoning depth.

    Start with standard chain-of-thought. Upgrade to self-consistency when results are inconsistent. Switch to tree-of-thought when the model keeps making early errors. Reach for skeleton-of-thought when generation is too slow. And when the stakes are high, combine them.

    Complete Code

    Here is every code block from this tutorial combined into a single runnable script. Replace the API key and run top-to-bottom.

    Complete code — all techniques in one script
    Loading editor...

    References

  • Wang, X. et al. — "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR 2023. arXiv:2203.11171
  • Yao, S. et al. — "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS 2023. arXiv:2305.10601
  • Ning, X. et al. — "Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation." ICLR 2024. arXiv:2307.15337
  • Wei, J. et al. — "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. arXiv:2201.11903
  • OpenAI — Chat Completions API Reference. Link
  • Yao, S. et al. — "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. arXiv:2210.03629
  • Long, J. — "Large Language Model Guided Tree-of-Thought." 2023. arXiv:2305.08291
  • Related Tutorials