Build an Automated Prompt Evaluation Pipeline with LLM-as-Judge
You rewrote your prompt three times this week. Each version "felt" better, but you have no data to prove it. Your team picks the prompt that the most senior person likes, and nobody revisits the decision. This is how most teams ship prompts today — on vibes.
By the end of this tutorial, you will have a reusable EvalPipeline class that scores prompt variants against a rubric, runs them across test cases, and tells you which variant is statistically better — not just anecdotally better.
Why Prompt Evaluation Matters
I have watched teams iterate on prompts for weeks, only to ship the version that happened to produce one good output during a live demo. The fundamental problem is straightforward: without a systematic way to measure prompt quality, every decision is subjective.
Consider a prompt that generates product descriptions. Version A is concise. Version B adds emotional appeal. Version C includes technical specs. Which is "best"? That depends entirely on what you are optimizing for — and you cannot optimize what you do not measure.
A prompt evaluation pipeline solves this by giving you three things: a rubric that defines what "good" means, a judge that scores outputs against that rubric, and a statistical test that tells you whether the score difference is real or noise.
Here is a minimal example. We will give the LLM a product description and ask it to score the description on a 1-5 scale for clarity.
That jargon-heavy description should score low — probably a 1 or 2. The judge is not guessing. It is applying the rubric you defined. The quality of your rubric directly controls the quality of your evaluation.
Designing an Evaluation Rubric
A rubric is the contract between you and the judge. It defines every criterion, every score level, and every boundary. I have seen teams skip this step and go straight to "rate this output 1-10" — the scores end up meaningless because the judge has no anchor for what a 4 versus a 7 means.
A strong rubric has three properties. First, each criterion is independent — you can score clarity without thinking about tone. Second, each score level has a concrete description — not just "good" and "bad" but what specifically makes a 3 different from a 4. Third, the rubric includes examples at each level so the judge has reference points.
The rubric above evaluates three independent dimensions. Each dimension has a 1-5 scale with specific anchors. When we pass this to the LLM judge, it scores each criterion separately rather than producing a single ambiguous "quality" number.
The next step is turning this rubric into a judge prompt that the LLM can follow consistently. The prompt needs to include the full rubric text, the output to evaluate, and explicit instructions for the response format.
The Judge Function — Scoring a Single Output
What surprised me when I first started doing this: the judge often catches quality differences that I missed during manual review. Descriptions I thought were fine would score 2/5 on completeness because they omitted the product's key differentiator. The rubric forces a level of scrutiny that human reviewers skip when they are tired.
The good description should consistently outscore the bad one on clarity and completeness. Tone might be closer — the jargon-heavy version is not offensive, just empty. This is exactly the kind of nuance that per-criterion scoring reveals. A single "overall quality" score would hide it.
One call is useful for spot-checking. But to compare prompt variants, you need to run the judge across a batch of test cases and aggregate the scores.
Create a function create_email_rubric() that returns a rubric dictionary for evaluating customer support email replies. The rubric must have exactly 2 criteria: "empathy" (does the reply acknowledge the customer's frustration?) and "actionability" (does the reply give clear next steps?). Each criterion needs a "name", "description", and "scale" dict with keys 1 through 5. The function should return the rubric in the same format as our rubric variable above (a dict with a "criteria" key containing a list).
Building the EvalPipeline Class
Calling judge_output manually works for a quick check. But a real evaluation needs structure: multiple prompt variants, multiple test inputs, aggregated scores, and comparison logic. The EvalPipeline class wraps all of this into a single interface.
The design is intentionally simple. You register prompt variants (each is a function that takes an input and returns a prompt string), register test cases, and call run(). The pipeline evaluates every variant against every test case, stores the results, and can compare any two variants with a built-in significance test.
We define the permutation test first because the EvalPipeline.compare() method depends on it. A permutation test is a statistics technique that requires no assumptions about how scores are distributed — unlike a t-test, which assumes a bell curve. I prefer it for LLM evaluation because scores on a 1-5 scale are discrete and rarely follow a normal distribution.
The class separates four concerns. Variants define how you prompt the model. Test cases define what inputs to evaluate on. Results store the scores for downstream analysis. Comparison runs statistical tests between any two variants. This separation means you can swap rubrics, add test cases, or change the judge model without rewriting anything.
Running a Prompt Comparison
This is where it gets interesting. I remember the first time I ran a pipeline like this on a real project — the "obviously better" prompt that the entire team preferred actually scored lower on completeness. The data contradicted our intuition, and the data was right.
We will define two prompt variants for generating product descriptions, run them against three test products, and compare their scores. One variant focuses on features, the other on benefits.
Six evaluations total — two variants times three test cases. Each evaluation involves one generation call and one judge call, so twelve API calls in total. For larger pipelines, you would run these concurrently, but sequential execution is easier to debug.
Look at the per-criterion scores, not just the averages. A variant that scores 5 on clarity but 2 on completeness tells a different story than one that scores 3 across the board. The first has a fixable problem; the second might need a complete rewrite.
Write a function storytelling_prompt(product) that takes a product name string and returns a prompt string. The prompt should instruct the LLM to write a 2-sentence product description using a storytelling approach — describing a scenario where someone uses the product. The function must return a string that contains both the word "story" (or "storytelling") and the product name.
Analyzing Results with Statistical Significance
Variant A averaged 3.8 and variant B averaged 4.1. Is B actually better, or did it just get lucky on a few test cases? This is the same question that runs through every A/B test in the industry, and I have learned the hard way that eyeballing averages leads to wrong conclusions about half the time.
The compare() method we built into the pipeline runs a permutation test per criterion. A p-value below 0.05 means the score difference is unlikely to be random noise. But statistical significance alone is not enough — a difference of 0.1 on a 5-point scale might be "real" but too small to care about.
With those synthetic scores (A averaging 3.3, B averaging 4.4), the difference of 1.1 points should be highly significant — a p-value well below 0.05. The permutation test confirms this gap is extremely unlikely to occur by random chance.
What does the p-value actually mean? A p-value of 0.03 means: "If there were truly no difference between these two prompts, we would see a gap this large only 3% of the time." The standard threshold is 0.05 — below that, the difference is statistically significant.
Here is the comparison applied to our actual pipeline results, using the pipeline.compare() method.
With only three test cases, the p-values will likely be high — not significant. This is expected. Three data points give very low statistical power. In a real evaluation with 20+ test cases, these same score differences would likely cross the significance threshold.
Generating a Full Report
I always produce a report after each pipeline run — partly for my own decision-making, partly so I have a record to look back at when someone asks "why did we choose this prompt?" three months later. Here is the reporting function.
Common Mistakes and How to Fix Them
# BAD: No anchor points, judge interprets freely
judge_prompt = """Rate this text quality from 1 to 10.
Text: "Our product is great."
Score:"""# GOOD: Each score level has a concrete meaning
judge_prompt = """Score CLARITY from 1 to 5:
1 = Cannot understand what the product does
3 = Understandable but vague
5 = Crystal clear to any reader
Text: "Our product is great."
Score:"""Without anchors, one LLM call might rate "good" as 7 and another as 5. The rubric removes that ambiguity.
# BAD: Random scoring
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=1.0, # high randomness
)# GOOD: Deterministic scoring
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.0, # consistent results
)High temperature during judging introduces random noise into your scores. The whole point of an evaluation pipeline is to remove subjectivity — do not add randomness back in.
A third mistake I see regularly in code reviews: using the same three test cases every time. Your pipeline will overfit to those cases. Rotate test cases, include edge cases (very short inputs, ambiguous requests, adversarial inputs), and periodically add new ones from real user queries.
Write a function is_significant(scores_a, scores_b) that takes two lists of integer scores and returns a dictionary with three keys: "mean_a" (float, mean of scores_a), "mean_b" (float, mean of scores_b), and "significant" (bool, True if the absolute difference in means is >= 0.5 AND the permutation test p-value < 0.05). Use the permutation_test function already defined above. This combines both practical and statistical significance into one check.
When NOT to Use LLM-as-Judge
I learned this the hard way on a medical Q&A project. Our judge gave high clarity scores to answers that were factually wrong but beautifully written. LLM-as-Judge is powerful but not universal — here are the specific scenarios where it will mislead you.
Factual accuracy. LLMs are not reliable fact-checkers. If your prompt generates medical advice, legal information, or numerical calculations, you need ground-truth validation — not LLM scoring. A judge model can confidently rate a factually wrong answer as "clear and well-structured."
Code generation. For evaluating generated code, run the code against test cases. Execution-based evaluation is objective and precise. An LLM judge might rate broken code as "looks correct" because the syntax appears reasonable.
Highly subjective criteria. If your rubric criterion is "Is this creative?" even human evaluators disagree. LLM judges produce consistent but potentially arbitrary scores on subjective dimensions. Use LLM judges for criteria where humans agree at least 80% of the time.
Frequently Asked Questions
How many test cases do I need for reliable results?
At minimum 20-30 test cases per variant. With fewer than 20, the permutation test has low statistical power — it cannot detect real differences unless they are very large. For production evaluations, 50-100 test cases sampled from actual user inputs give the most trustworthy results.
Can I use the same model as both generator and judge?
You can, but there is a known bias: models tend to prefer their own outputs. If you are evaluating GPT-4o-mini outputs, judging with GPT-4o-mini will slightly inflate scores compared to judging with Claude or Gemini. For critical decisions, use a judge from a different model family or a stronger model in the same family.
How do I evaluate prompts that produce long outputs?
Long outputs strain the judge's context window and reduce scoring accuracy. Two solutions: chunk the output and score each chunk separately then average, or add a "conciseness" criterion to the rubric that penalizes unnecessary length. The second approach also incentivizes the generator to be succinct.
What if my judge keeps returning malformed JSON?
Three fixes in order of reliability. First, add a JSON example to the judge prompt (we did this). Second, set temperature=0.0 — higher temperatures cause more formatting errors. Third, use structured output if your provider supports it (response_format={"type": "json_object"} on OpenAI). If the problem persists, wrap the parse in a try/except and retry once.