Hallucination Detection Pipeline: Catch LLM Lies Before Your Users Do
Your LLM just told a customer that your product has a feature it does not have. Or it cited a research paper that does not exist. Or it confidently stated a legal regulation that was repealed three years ago. These are hallucinations, and they are the single biggest reason production AI systems lose trust. In this tutorial, we build a complete detection pipeline — from cheap string-level checks to LLM-powered claim verification — so you catch the lies before your users do.
What Are Hallucinations and Why Should You Care?
An LLM hallucination is any generated output that looks plausible but is factually wrong, unsupported by the source material, or entirely fabricated. The model does not "know" it is lying — it is simply predicting the most likely next token, and sometimes the most likely continuation is wrong.
I have seen hallucinations range from subtle (wrong version numbers in documentation) to catastrophic (fabricated legal citations in a compliance tool). The tricky part is that hallucinated text reads exactly like correct text — same confidence, same formatting, same authoritative tone. You cannot eyeball it at scale. You need automated detection. There are three broad categories worth understanding before we start building:
Our detection pipeline will tackle all three. We start with the cheapest, fastest checks and escalate to more expensive LLM-powered verification only when needed.
Setting Up the OpenAI Client
Every code block in this tutorial runs directly in your browser. We will use the OpenAI API for the LLM-powered verification steps. The pure Python checks (string matching, claim extraction logic) need no API key at all.
Level 1: Cheap String-Level Checks
Before burning API credits on fancy verification, catch the obvious problems with plain Python. These checks cost zero tokens and run in microseconds. I always run these first in production — they catch more problems than you would expect.
Contradiction Detector
The simplest hallucination signal: the LLM contradicts itself within the same response. Look for numerical inconsistencies and conflicting statements.
The detector catches two contradictions: "Acme Corp" has conflicting founding years (2015 and 2013), and "Revenue" has conflicting values (5.2 and 8.1 million). Simple regex, zero API cost, immediate signal that this response needs re-generation.
Hedging and Confidence Signals
LLMs sometimes hedge when they are uncertain — phrases like "I believe," "approximately," or "it is possible that." These hedge phrases are not proof of hallucination, but they correlate with lower factual accuracy. When an LLM is confident and correct, it rarely hedges.
Quick check: What hedge score would you expect for the text "The answer is definitely 42. I am sure of it."? Answer: 0.0 — "definitely" and "I am sure" signal certainty, not uncertainty. Our detector only flags hedging phrases like "I believe" or "possibly." Confident-sounding text gets a clean bill of health, which is exactly what we want.
Claim Extraction — Breaking Output into Verifiable Statements
A paragraph of LLM output might contain five distinct factual claims. To verify each one, we first need to extract them. This is where most people skip a step: they try to verify the entire paragraph as one blob. That misses individual false claims buried in otherwise correct text.
We can extract claims two ways: with regex patterns for structured claims (numbers, dates, names) and with an LLM for nuanced factual statements. Let us build both.
Pattern-Based Claim Extraction (No API Needed)
We scan for three patterns: date claims (years mentioned with context words like "in" or "since"), numerical claims (amounts with units like "million" or "%"), and entity claims ("X is/was Y" patterns). The function returns each claim with its type and the full sentence it appeared in.
Notice that our test text contains a deliberate error: TensorFlow was released by Google, not Microsoft. The claim extractor does not judge accuracy — it just identifies the claims. Verification comes next.
LLM-Powered Claim Extraction
Pattern-based extraction misses claims that do not follow neat "X is Y" patterns. For thorough extraction, we ask the LLM itself to decompose a paragraph into individual claims. This costs a few cents but catches everything.
The LLM identifies claims that regex would miss — like "surpassing R in 2018" (a comparison claim) and the attribution of TensorFlow to Microsoft (which is wrong — it should be Google). Each claim now becomes an independent verification target.
Self-Consistency Checks — Ask the Same Question Multiple Times
This is one of my favourite detection techniques because the intuition is dead simple: if you ask the same question five times and get five different answers, the model is probably guessing. If all five agree, the answer is more likely correct.
Self-consistency was introduced in the "Self-Consistency Improves Chain of Thought Reasoning" paper (Wang et al., 2022). The idea: sample multiple responses at higher temperature, then check whether they converge on the same answer. Disagreement is a hallucination signal.
The agreement ratio alone is useful, but we can do better. Let us extract the key factual claims from each response and compare them directly.
Source-Grounded Verification — Checking Claims Against Reference Text
Self-consistency tells you whether the model is sure of itself. Source-grounded verification tells you whether the model is correct. The idea: given a reference document (your knowledge base, a retrieved passage, a product spec), check whether each claim in the LLM output is actually supported by that source.
This is the core technique behind RAG evaluation and is essential if your LLM is supposed to answer questions about specific documents. Without it, the model can confidently fabricate details that sound like they came from the document but did not.
Before reading the results, try this: look at the six claims above and the source document. Which ones would you flag? The verification should catch three problems: the widget count is 50, not 100 (CONTRADICTED); there is no Python notebook feature (NOT_FOUND — the LLM fabricated it); and data retention is 12 months on Standard, not 24 (CONTRADICTED). The other three claims check out.
Write a function extract_year_claims(text) that extracts all year-based claims from a text. The function should find patterns where a year (4-digit number between 1900 and 2030) appears with surrounding context, and return a list of dictionaries with keys "year" (int) and "context" (the sentence containing the year). Important: If the same sentence contains multiple years, return a separate entry for each year.
The HallucinationDetector Class — Putting It All Together
Now we combine every technique into a single reusable class. The detector runs checks in order of cost: free string checks first, then LLM-powered verification only if needed. This layered approach keeps costs low while catching hallucinations at every level.
The class follows a simple pattern: cheap checks first, then escalate. Let us run it on a real example to see the full pipeline in action.
The pipeline catches multiple issues: Oracle is not a supported data source (contradicted), 100 nodes is wrong — it is 50 (contradicted), 25 users is wrong — it is 10 (contradicted), and the ML training capabilities and $50M valuation are entirely fabricated (not found in source). This is exactly the kind of output that would damage user trust if served unchecked.
Write a function score_confidence(text) that analyzes a text for confidence signals and returns a dictionary with three keys:
"confident_phrases": count of confident phrases (e.g., "definitely", "certainly", "without a doubt", "it is clear", "undoubtedly")"uncertain_phrases": count of uncertain phrases (e.g., "maybe", "perhaps", "possibly", "might", "could be", "I think", "I believe")"confidence_ratio": a float from 0.0 to 1.0 calculated as confident / (confident + uncertain). If both counts are 0, return 0.5.Use case-insensitive matching.
Real-World Pipeline Patterns
In production, you rarely run the full pipeline on every response. Here are the patterns I have seen work well across different use cases.
Pattern 1: Gate-and-Escalate
Pattern 2: Batch Verification for RAG
When your LLM generates answers based on retrieved documents, verify the answer against the retrieved chunks in batch. This is the pattern used in production RAG evaluation.
Common Mistakes and How to Avoid Them
# BAD: treats entire response as single claim
response = "Python was made in 1991 by Guido. It has 50M users."
is_correct = verify(response) # Miss individual false claims# GOOD: extract then verify each claim
claims = extract_claims(response)
for claim in claims:
result = verify(claim, source)
# Catches "50M users" as unverifiableAnother mistake I see frequently: running self-consistency checks at temperature 0. At zero temperature, you get nearly identical responses every time, so the agreement ratio is always high. That tells you nothing about whether the answer is actually correct. Use temperature 0.7 or higher for self-consistency.
# BAD: identical responses = false confidence
results = []
for _ in range(5):
r = await client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.0, # Always same answer!
messages=[{"role": "user", "content": q}]
)
results.append(r)
# Agreement = 100% but answer may be wrong# GOOD: diverse sampling reveals uncertainty
results = []
for _ in range(5):
r = await client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.7, # Allows variation
messages=[{"role": "user", "content": q}]
)
results.append(r)
# Low agreement = model is uncertainA third trap: treating NOT_FOUND as acceptable. When a claim is not found in the source, that often means the LLM fabricated it. In a customer-facing product, fabricated claims are just as damaging as contradicted ones. My rule: in strict mode, NOT_FOUND should be treated the same as CONTRADICTED.
Write a function summarize_verdicts(verdicts) that takes a list of verdict dictionaries (each with a "verdict" key whose value is "SUPPORTED", "CONTRADICTED", or "NOT_FOUND") and returns a summary dictionary with:
"total": total number of verdicts"supported": count of SUPPORTED verdicts"contradicted": count of CONTRADICTED verdicts"not_found": count of NOT_FOUND verdicts"groundedness": float, supported / total (0.0 if total is 0)"risk_level": "LOW" if groundedness >= 0.8, "MEDIUM" if >= 0.5, "HIGH" otherwisePerformance and Cost Considerations
Hallucination detection adds latency and cost to every LLM response. Here is how the three levels compare:
Frequently Asked Questions
Can I use a different LLM for verification than the one that generated the response?
Yes, and in many cases you should. Using the same model to verify its own output is like asking a student to grade their own exam. A different model (or even a smaller, cheaper model focused on fact-checking) can catch errors the original model is blind to. In practice, gpt-4o-mini works well as a verifier even for responses generated by larger models.
How do I handle hallucinations in streaming responses?
You have two options. First, buffer the complete response and verify before displaying — this adds latency but catches everything. Second, stream the response to the user and run verification in parallel, then display a warning badge if issues are found after the fact. The second approach is better for UX but requires a mechanism to update or annotate the response retroactively.
What groundedness score should I target for production?
It depends on the stakes. For a casual chatbot, a groundedness score of 0.7 (70% of claims supported by source) is often acceptable. For medical, legal, or financial applications, target 0.95 or higher. Below that threshold, either re-generate the response with more explicit grounding instructions or fall back to a canned response that you know is correct.