Skip to main content

Hallucination Detection Pipeline: Catch LLM Lies Before Your Users Do

Intermediate90 min3 exercises55 XP
0/3 exercises

Your LLM just told a customer that your product has a feature it does not have. Or it cited a research paper that does not exist. Or it confidently stated a legal regulation that was repealed three years ago. These are hallucinations, and they are the single biggest reason production AI systems lose trust. In this tutorial, we build a complete detection pipeline — from cheap string-level checks to LLM-powered claim verification — so you catch the lies before your users do.

What Are Hallucinations and Why Should You Care?

An LLM hallucination is any generated output that looks plausible but is factually wrong, unsupported by the source material, or entirely fabricated. The model does not "know" it is lying — it is simply predicting the most likely next token, and sometimes the most likely continuation is wrong.

I have seen hallucinations range from subtle (wrong version numbers in documentation) to catastrophic (fabricated legal citations in a compliance tool). The tricky part is that hallucinated text reads exactly like correct text — same confidence, same formatting, same authoritative tone. You cannot eyeball it at scale. You need automated detection. There are three broad categories worth understanding before we start building:

The three types of hallucination
Loading editor...

Our detection pipeline will tackle all three. We start with the cheapest, fastest checks and escalate to more expensive LLM-powered verification only when needed.

Setting Up the OpenAI Client

Every code block in this tutorial runs directly in your browser. We will use the OpenAI API for the LLM-powered verification steps. The pure Python checks (string matching, claim extraction logic) need no API key at all.

Install and configure the OpenAI client
Loading editor...

Level 1: Cheap String-Level Checks

Before burning API credits on fancy verification, catch the obvious problems with plain Python. These checks cost zero tokens and run in microseconds. I always run these first in production — they catch more problems than you would expect.

Contradiction Detector

The simplest hallucination signal: the LLM contradicts itself within the same response. Look for numerical inconsistencies and conflicting statements.

Detect self-contradicting numbers
Loading editor...

The detector catches two contradictions: "Acme Corp" has conflicting founding years (2015 and 2013), and "Revenue" has conflicting values (5.2 and 8.1 million). Simple regex, zero API cost, immediate signal that this response needs re-generation.

Hedging and Confidence Signals

LLMs sometimes hedge when they are uncertain — phrases like "I believe," "approximately," or "it is possible that." These hedge phrases are not proof of hallucination, but they correlate with lower factual accuracy. When an LLM is confident and correct, it rarely hedges.

Python
Loading editor...

Quick check: What hedge score would you expect for the text "The answer is definitely 42. I am sure of it."? Answer: 0.0 — "definitely" and "I am sure" signal certainty, not uncertainty. Our detector only flags hedging phrases like "I believe" or "possibly." Confident-sounding text gets a clean bill of health, which is exactly what we want.

Claim Extraction — Breaking Output into Verifiable Statements

A paragraph of LLM output might contain five distinct factual claims. To verify each one, we first need to extract them. This is where most people skip a step: they try to verify the entire paragraph as one blob. That misses individual false claims buried in otherwise correct text.

We can extract claims two ways: with regex patterns for structured claims (numbers, dates, names) and with an LLM for nuanced factual statements. Let us build both.

Pattern-Based Claim Extraction (No API Needed)

We scan for three patterns: date claims (years mentioned with context words like "in" or "since"), numerical claims (amounts with units like "million" or "%"), and entity claims ("X is/was Y" patterns). The function returns each claim with its type and the full sentence it appeared in.

Extract verifiable claims with regex
Loading editor...

Notice that our test text contains a deliberate error: TensorFlow was released by Google, not Microsoft. The claim extractor does not judge accuracy — it just identifies the claims. Verification comes next.

LLM-Powered Claim Extraction

Pattern-based extraction misses claims that do not follow neat "X is Y" patterns. For thorough extraction, we ask the LLM itself to decompose a paragraph into individual claims. This costs a few cents but catches everything.

LLM-powered claim decomposition
Loading editor...

The LLM identifies claims that regex would miss — like "surpassing R in 2018" (a comparison claim) and the attribution of TensorFlow to Microsoft (which is wrong — it should be Google). Each claim now becomes an independent verification target.

Self-Consistency Checks — Ask the Same Question Multiple Times

This is one of my favourite detection techniques because the intuition is dead simple: if you ask the same question five times and get five different answers, the model is probably guessing. If all five agree, the answer is more likely correct.

Self-consistency was introduced in the "Self-Consistency Improves Chain of Thought Reasoning" paper (Wang et al., 2022). The idea: sample multiple responses at higher temperature, then check whether they converge on the same answer. Disagreement is a hallucination signal.

Self-consistency check across multiple samples
Loading editor...

The agreement ratio alone is useful, but we can do better. Let us extract the key factual claims from each response and compare them directly.

Claim-level consistency analysis
Loading editor...

Source-Grounded Verification — Checking Claims Against Reference Text

Self-consistency tells you whether the model is sure of itself. Source-grounded verification tells you whether the model is correct. The idea: given a reference document (your knowledge base, a retrieved passage, a product spec), check whether each claim in the LLM output is actually supported by that source.

This is the core technique behind RAG evaluation and is essential if your LLM is supposed to answer questions about specific documents. Without it, the model can confidently fabricate details that sound like they came from the document but did not.

Verify claims against a source document
Loading editor...

Before reading the results, try this: look at the six claims above and the source document. Which ones would you flag? The verification should catch three problems: the widget count is 50, not 100 (CONTRADICTED); there is no Python notebook feature (NOT_FOUND — the LLM fabricated it); and data retention is 12 months on Standard, not 24 (CONTRADICTED). The other three claims check out.

Build a Claim Extractor Function
Write Code

Write a function extract_year_claims(text) that extracts all year-based claims from a text. The function should find patterns where a year (4-digit number between 1900 and 2030) appears with surrounding context, and return a list of dictionaries with keys "year" (int) and "context" (the sentence containing the year). Important: If the same sentence contains multiple years, return a separate entry for each year.

Loading editor...

The HallucinationDetector Class — Putting It All Together

Now we combine every technique into a single reusable class. The detector runs checks in order of cost: free string checks first, then LLM-powered verification only if needed. This layered approach keeps costs low while catching hallucinations at every level.

The complete HallucinationDetector class
Loading editor...

The class follows a simple pattern: cheap checks first, then escalate. Let us run it on a real example to see the full pipeline in action.

Run the full pipeline
Loading editor...

The pipeline catches multiple issues: Oracle is not a supported data source (contradicted), 100 nodes is wrong — it is 50 (contradicted), 25 users is wrong — it is 10 (contradicted), and the ML training capabilities and $50M valuation are entirely fabricated (not found in source). This is exactly the kind of output that would damage user trust if served unchecked.

Build a Confidence Scorer
Write Code

Write a function score_confidence(text) that analyzes a text for confidence signals and returns a dictionary with three keys:

  • "confident_phrases": count of confident phrases (e.g., "definitely", "certainly", "without a doubt", "it is clear", "undoubtedly")
  • "uncertain_phrases": count of uncertain phrases (e.g., "maybe", "perhaps", "possibly", "might", "could be", "I think", "I believe")
  • "confidence_ratio": a float from 0.0 to 1.0 calculated as confident / (confident + uncertain). If both counts are 0, return 0.5.
  • Use case-insensitive matching.

    Loading editor...

    Real-World Pipeline Patterns

    In production, you rarely run the full pipeline on every response. Here are the patterns I have seen work well across different use cases.

    Pattern 1: Gate-and-Escalate

    Gate-and-escalate pattern
    Loading editor...

    Pattern 2: Batch Verification for RAG

    When your LLM generates answers based on retrieved documents, verify the answer against the retrieved chunks in batch. This is the pattern used in production RAG evaluation.

    RAG response verification
    Loading editor...

    Common Mistakes and How to Avoid Them

    Verifying the whole paragraph as one unit
    # BAD: treats entire response as single claim
    response = "Python was made in 1991 by Guido. It has 50M users."
    is_correct = verify(response)  # Miss individual false claims
    Breaking into individual claims first
    # GOOD: extract then verify each claim
    claims = extract_claims(response)
    for claim in claims:
        result = verify(claim, source)
        # Catches "50M users" as unverifiable

    Another mistake I see frequently: running self-consistency checks at temperature 0. At zero temperature, you get nearly identical responses every time, so the agreement ratio is always high. That tells you nothing about whether the answer is actually correct. Use temperature 0.7 or higher for self-consistency.

    Self-consistency at temperature 0
    # BAD: identical responses = false confidence
    results = []
    for _ in range(5):
        r = await client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0.0,  # Always same answer!
            messages=[{"role": "user", "content": q}]
        )
        results.append(r)
    # Agreement = 100% but answer may be wrong
    Self-consistency at temperature 0.7
    # GOOD: diverse sampling reveals uncertainty
    results = []
    for _ in range(5):
        r = await client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0.7,  # Allows variation
            messages=[{"role": "user", "content": q}]
        )
        results.append(r)
    # Low agreement = model is uncertain

    A third trap: treating NOT_FOUND as acceptable. When a claim is not found in the source, that often means the LLM fabricated it. In a customer-facing product, fabricated claims are just as damaging as contradicted ones. My rule: in strict mode, NOT_FOUND should be treated the same as CONTRADICTED.

    Build a Verdict Summarizer
    Write Code

    Write a function summarize_verdicts(verdicts) that takes a list of verdict dictionaries (each with a "verdict" key whose value is "SUPPORTED", "CONTRADICTED", or "NOT_FOUND") and returns a summary dictionary with:

  • "total": total number of verdicts
  • "supported": count of SUPPORTED verdicts
  • "contradicted": count of CONTRADICTED verdicts
  • "not_found": count of NOT_FOUND verdicts
  • "groundedness": float, supported / total (0.0 if total is 0)
  • "risk_level": "LOW" if groundedness >= 0.8, "MEDIUM" if >= 0.5, "HIGH" otherwise
  • Loading editor...

    Performance and Cost Considerations

    Hallucination detection adds latency and cost to every LLM response. Here is how the three levels compare:

    Cost and latency comparison
    Loading editor...

    Frequently Asked Questions

    Can I use a different LLM for verification than the one that generated the response?

    Yes, and in many cases you should. Using the same model to verify its own output is like asking a student to grade their own exam. A different model (or even a smaller, cheaper model focused on fact-checking) can catch errors the original model is blind to. In practice, gpt-4o-mini works well as a verifier even for responses generated by larger models.

    How do I handle hallucinations in streaming responses?

    You have two options. First, buffer the complete response and verify before displaying — this adds latency but catches everything. Second, stream the response to the user and run verification in parallel, then display a warning badge if issues are found after the fact. The second approach is better for UX but requires a mechanism to update or annotate the response retroactively.

    What groundedness score should I target for production?

    It depends on the stakes. For a casual chatbot, a groundedness score of 0.7 (70% of claims supported by source) is often acceptable. For medical, legal, or financial applications, target 0.95 or higher. Below that threshold, either re-generate the response with more explicit grounding instructions or fall back to a canned response that you know is correct.

    Python
    Loading editor...

    References

  • Wang, X. et al. — "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171 (2022). Link
  • Min, S. et al. — "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation." arXiv:2305.14251 (2023). Link
  • Manakul, P. et al. — "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." arXiv:2303.08896 (2023). Link
  • OpenAI — Chat Completions API documentation. Link
  • Ji, Z. et al. — "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys (2023). Link
  • Huang, L. et al. — "A Survey on Hallucination in Large Language Models." arXiv:2311.05232 (2023). Link
  • Related Tutorials