Hallucination Detection Pipeline: Catch LLM Lies Before Your Users Do

Intermediate90 min3 exercises55 XP

Prerequisites

0/3 exercises

Your LLM just told a customer that your product has a feature it does not have. Or it cited a research paper that does not exist. Or it confidently stated a legal regulation that was repealed three years ago. These are hallucinations, and they are the single biggest reason production AI systems lose trust.

In this tutorial, we build a complete detection pipeline — from cheap string-level checks to LLM-powered claim verification — so you catch the lies before your users do. If you have completed our Your First AI App tutorial and are comfortable with the OpenAI API basics, you have everything you need.

What Are Hallucinations and Why Should You Care?

An LLM hallucination is any generated output that looks plausible but is factually wrong, unsupported by the source material, or entirely fabricated. The model does not "know" it is lying — it is simply predicting the most likely next token, and sometimes the most likely continuation is wrong.

I have seen hallucinations range from subtle (wrong version numbers in documentation) to catastrophic (fabricated legal citations in a compliance tool). The tricky part is that hallucinated text reads exactly like correct text — same confidence, same formatting, same authoritative tone. You cannot eyeball it at scale.

Three broad categories of hallucination matter for detection. Intrinsic hallucinations contradict the provided source, extrinsic ones add claims no source mentions, and factual ones state things that are objectively wrong about the world. The code below defines each with an example and a risk rating.

The three types of hallucination

Loading editor...

Our detection pipeline will tackle all three. We start with the cheapest, fastest checks and escalate to more expensive LLM-powered verification only when needed.

Setting Up the OpenAI Client

Every code block in this tutorial runs directly in your browser. We will use the OpenAI API for the LLM-powered verification steps. The pure Python checks (string matching, claim extraction logic) need no API key at all.

Install and configure the OpenAI client

Loading editor...

Level 1: Cheap String-Level Checks

Before burning API credits on fancy verification, catch the obvious problems with plain Python. These checks cost zero tokens and run in microseconds. I always run these first in production — they catch more problems than you would expect.

Contradiction Detector

The simplest hallucination signal: the LLM contradicts itself within the same response. The find_number_contradictions function below uses a regex pattern to extract every phrase matching "Entity is/was/has Number" (e.g., "Revenue was 5.2 million"). It groups extracted numbers by entity name, then checks whether any entity appears with two different values. If "Acme Corp" is paired with both 2015 and 2013, that is an immediate red flag. The function returns a list of dictionaries, each with the entity, its conflicting values, and a severity rating.

Detect self-contradicting numbers

Loading editor...

The detector catches two contradictions: "Acme Corp" has conflicting founding years (2015 and 2013), and "Revenue" has conflicting values (5.2 and 8.1 million). Simple regex, zero API cost, immediate signal that this response needs re-generation.

Hedging and Confidence Signals

LLMs sometimes hedge when they are uncertain — phrases like "I believe," "approximately," or "it is possible that." These phrases are not proof of hallucination, but they correlate with lower factual accuracy. When an LLM is confident and correct, it rarely hedges.

Python

Loading editor...

Quick check: What hedge score would you expect for the text "The answer is definitely 42. I am sure of it."? Answer: 0.0 — "definitely" and "I am sure" signal certainty, not uncertainty. Our detector only flags hedging phrases like "I believe" or "possibly." Confident-sounding text gets a clean bill of health, which is exactly what we want.

Claim Extraction — Breaking Output into Verifiable Statements

A paragraph of LLM output might contain five distinct factual claims. To verify each one, we first need to extract them. This is where most people skip a step: they try to verify the entire paragraph as one blob, which misses individual false claims buried in otherwise correct text.

Pattern-Based Claim Extraction (No API Needed)

The extract_factual_claims function below scans text for three distinct regex patterns. First, it finds date claims — years that appear after context words like "in" or "since." Second, it catches numerical claims — amounts followed by units like "million" or "%." Third, it extracts entity claims using "X is/was Y" patterns. A helper function, get_containing_sentence, grabs the full sentence around each match so we have context for downstream verification.

Extract verifiable claims with regex

Loading editor...

Notice that our test text contains a deliberate error: TensorFlow was released by Google, not Microsoft. The claim extractor does not judge accuracy — it just identifies the claims. Verification comes next.

LLM-Powered Claim Extraction

Pattern-based extraction misses claims that do not follow neat "X is Y" patterns — comparison claims, causal statements, and implicit assertions all slip through. The function below sends the text to gpt-4o-mini with a system prompt that asks for a JSON array of objects, each containing the claim text, its type (date, number, attribution, definition, or comparison), and whether it is objectively verifiable. We strip any markdown code fences the model might wrap around its JSON before parsing.

LLM-powered claim decomposition

Loading editor...

The LLM identifies claims that regex would miss — like "surpassing R in 2018" (a comparison claim) and the attribution of TensorFlow to Microsoft (which is wrong — it should be Google). Each claim now becomes an independent verification target.

Self-Consistency Checks — Ask the Same Question Multiple Times

This is one of my favourite detection techniques because the intuition is dead simple: if you ask the same question five times and get five different answers, the model is probably guessing. If all five agree, the answer is more likely correct.

The self_consistency_check function below sends the same question to gpt-4o-mini N times at temperature 0.7, collects all responses, counts unique answers, and computes an agreement ratio between 0.0 (total disagreement) and 1.0 (perfect agreement). When the ratio drops below 0.5, we flag the answer as a likely hallucination.

Self-consistency check across multiple samples

Loading editor...

The agreement ratio alone is useful, but we can sharpen it. The function below goes deeper: it runs self-consistency, then extracts individual claims from each response with our LLM extractor, and compares those claims across responses. Claims that appear in some responses but not others are flagged as potentially hallucinated — a much more granular signal than comparing whole answers.

Claim-level consistency analysis

Loading editor...

Source-Grounded Verification — Checking Claims Against Reference Text

Self-consistency tells you whether the model is sure of itself. Source-grounded verification tells you whether the model is correct. The idea: given a reference document (your knowledge base, a retrieved passage, a product spec), check whether each claim in the LLM output is actually supported by that source.

This is the core technique behind RAG evaluation and is essential if your LLM answers questions about specific documents. Without it, the model can confidently fabricate details that sound like they came from the document but did not.

Verify claims against a source document

Loading editor...

Before reading the results, try this: look at the six claims above and the source document. Which ones would you flag? The verification should catch three problems: the widget count is 50, not 100 (CONTRADICTED); there is no Python notebook feature (NOT_FOUND — the LLM fabricated it); and data retention is 12 months on Standard, not 24 (CONTRADICTED). The other three claims check out.

Build a Claim Extractor Function

Write Code

Write a function extract_year_claims(text) that extracts all year-based claims from a text. The function should find patterns where a year (4-digit number between 1900 and 2030) appears with surrounding context, and return a list of dictionaries with keys "year" (int) and "context" (the sentence containing the year). Important: If the same sentence contains multiple years, return a separate entry for each year.

Loading editor...

The HallucinationDetector Class — Putting It All Together

We have built the individual pieces. The class below wires them into a single reusable pipeline that runs checks in strict cost order. Three methods handle the three levels: check_contradictions and check_hedging are free string checks (Level 1), extract_claims sends text to the LLM for claim decomposition (Level 2), and verify_claims fact-checks each claim against a source document (Level 3). The run_pipeline method orchestrates all three, accumulating a risk score as issues surface.

The complete HallucinationDetector class

Loading editor...

Time to see it in action. The test below feeds the detector a product description that contains several fabrications: Oracle is not listed as a supported data source, the node limit is 50 not 100, the user limit is 10 not 25, and "ML model training" and "$50M valuation" are entirely invented. Watch how the pipeline surfaces each one.

Run the full pipeline

Loading editor...

The pipeline catches multiple issues: Oracle is not a supported data source (contradicted), 100 nodes is wrong — it is 50 (contradicted), 25 users is wrong — it is 10 (contradicted), and the ML training capabilities and $50M valuation are entirely fabricated (not found in source). This is exactly the kind of output that would damage user trust if served unchecked.

Build a Confidence Scorer

Write Code

Write a function score_confidence(text) that analyzes a text for confidence signals and returns a dictionary with three keys:

"confident_phrases": count of confident phrases (e.g., "definitely", "certainly", "without a doubt", "it is clear", "undoubtedly")

"uncertain_phrases": count of uncertain phrases (e.g., "maybe", "perhaps", "possibly", "might", "could be", "I think", "I believe")

"confidence_ratio": a float from 0.0 to 1.0 calculated as confident / (confident + uncertain). If both counts are 0, return 0.5.

Use case-insensitive matching.

Loading editor...

Real-World Pipeline Patterns

In production, you rarely run the full pipeline on every response. Here are the patterns I have seen work well across different use cases.

Pattern 1: Gate-and-Escalate

The gate-and-escalate function below runs free string checks on every response. If those checks pass cleanly (no contradictions, low hedging), it returns immediately without spending any API tokens. Only when a red flag appears — or when strict mode is enabled for high-stakes outputs — does it invoke the full LLM-powered pipeline. Two test cases demonstrate both paths.

Gate-and-escalate pattern

Loading editor...

Pattern 2: Batch Verification for RAG

When your LLM generates answers based on retrieved documents, you need to verify the answer against the retrieved chunks. The verify_rag_response function below concatenates all retrieved chunks into a single source document, runs the full detection pipeline against it, and computes a groundedness score — the fraction of claims that the source material actually supports. If you are building a RAG pipeline from scratch or using LangChain for RAG, this verification step slots in right after generation.

RAG response verification

Loading editor...

Common Mistakes and How to Avoid Them

Verifying the whole paragraph as one unit

# BAD: treats entire response as single claim
response = "Python was made in 1991 by Guido. It has 50M users."
is_correct = verify(response)  # Miss individual false claims

Breaking into individual claims first

# GOOD: extract then verify each claim
claims = extract_claims(response)
for claim in claims:
    result = verify(claim, source)
    # Catches "50M users" as unverifiable

Another mistake I see frequently: running self-consistency checks at temperature 0. At zero temperature, you get nearly identical responses every time, so the agreement ratio is always high. That tells you nothing about whether the answer is actually correct. Use temperature 0.7 or higher for self-consistency.

Self-consistency at temperature 0

# BAD: identical responses = false confidence
results = []
for _ in range(5):
    r = await client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.0,  # Always same answer!
        messages=[{"role": "user", "content": q}]
    )
    results.append(r)
# Agreement = 100% but answer may be wrong

Self-consistency at temperature 0.7

# GOOD: diverse sampling reveals uncertainty
results = []
for _ in range(5):
    r = await client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.7,  # Allows variation
        messages=[{"role": "user", "content": q}]
    )
    results.append(r)
# Low agreement = model is uncertain

A third trap: treating NOT_FOUND as acceptable. When a claim is not found in the source, that often means the LLM fabricated it. In a customer-facing product, fabricated claims are just as damaging as contradicted ones. My rule: in strict mode, NOT_FOUND should be treated the same as CONTRADICTED.

Build a Verdict Summarizer

Write Code

Write a function summarize_verdicts(verdicts) that takes a list of verdict dictionaries (each with a "verdict" key whose value is "SUPPORTED", "CONTRADICTED", or "NOT_FOUND") and returns a summary dictionary with:

"total": total number of verdicts

"supported": count of SUPPORTED verdicts

"contradicted": count of CONTRADICTED verdicts

"not_found": count of NOT_FOUND verdicts

"groundedness": float, supported / total (0.0 if total is 0)

"risk_level": "LOW" if groundedness >= 0.8, "MEDIUM" if >= 0.5, "HIGH" otherwise

Loading editor...

Performance and Cost Considerations

Hallucination detection adds latency and cost to every LLM response. The table below compares the three pipeline levels so you can make informed trade-offs for your specific use case.

Cost and latency comparison

Loading editor...

Frequently Asked Questions

Can I use a different LLM for verification than the one that generated the response?

Yes, and in many cases you should. Using the same model to verify its own output is like asking a student to grade their own exam. A different model (or even a smaller, cheaper model focused on fact-checking) can catch errors the original model is blind to. In practice, gpt-4o-mini works well as a verifier even for responses generated by larger models.

How do I handle hallucinations in streaming responses?

You have two options. First, buffer the complete response and verify before displaying — this adds latency but catches everything. Second, stream the response to the user and run verification in parallel, then display a warning badge if issues are found after the fact. The second approach is better for UX but requires a mechanism to update or annotate the response retroactively.

What groundedness score should I target for production?

It depends on the stakes. For a casual chatbot, a groundedness score of 0.7 (70% of claims supported by source) is often acceptable. For medical, legal, or financial applications, target 0.95 or higher. Below that threshold, either re-generate the response with more explicit grounding instructions or fall back to a canned response.

Python

Loading editor...

How does NLI-based detection differ from LLM-as-judge?

NLI (Natural Language Inference) uses a smaller, specialized model like DeBERTa to classify whether a claim is entailed, contradicted, or neutral relative to a source passage. It is fast and free to run locally, but less flexible than an LLM-as-judge for complex or nuanced claims. LLM-as-judge (what we built in this tutorial) handles more varied claim types but costs API tokens and is slower. I prefer NLI for high-volume pre-filtering and LLM-as-judge for the final verdict on flagged claims.

Beyond Our Pipeline — Other Detection Approaches

The pipeline we built covers the most practical techniques, but the research landscape is broader. I want to mention the major alternatives so you know what exists when you need to scale beyond string checks and LLM-as-judge verification.

Detection approach comparison

Loading editor...

What to Learn Next

You now have a working hallucination detection pipeline that scales from free regex checks to LLM-powered claim verification. Here is where to go from here on PythonBook.

References

Wang, X. et al. — "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171 (2022). Link

Min, S. et al. — "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation." arXiv:2305.14251 (2023). Link

Manakul, P. et al. — "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." arXiv:2303.08896 (2023). Link

OpenAI — Chat Completions API documentation. Link

Ji, Z. et al. — "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys (2023). Link

Huang, L. et al. — "A Survey on Hallucination in Large Language Models." arXiv:2311.05232 (2023). Link

Hallucination Detection Pipeline: Catch LLM Lies Before Your Users Do

What Are Hallucinations and Why Should You Care?

Setting Up the OpenAI Client

Level 1: Cheap String-Level Checks

Contradiction Detector

Hedging and Confidence Signals

Claim Extraction — Breaking Output into Verifiable Statements

Pattern-Based Claim Extraction (No API Needed)

LLM-Powered Claim Extraction

Self-Consistency Checks — Ask the Same Question Multiple Times

Source-Grounded Verification — Checking Claims Against Reference Text

The HallucinationDetector Class — Putting It All Together

Real-World Pipeline Patterns

Pattern 1: Gate-and-Escalate

Pattern 2: Batch Verification for RAG

Common Mistakes and How to Avoid Them

Performance and Cost Considerations

Frequently Asked Questions

Can I use a different LLM for verification than the one that generated the response?

How do I handle hallucinations in streaming responses?

What groundedness score should I target for production?

How does NLI-based detection differ from LLM-as-judge?

Beyond Our Pipeline — Other Detection Approaches

What to Learn Next

References

Related Tutorials

Save your progress across devices