Build an Automated Prompt Evaluation Pipeline with LLM-as-Judge
You rewrote your prompt three times this week. Each version "felt" better, but you have no data to prove it. Your team picks the prompt that the most senior person likes, and nobody revisits the decision. This is how most teams ship prompts today — on vibes.
By the end of this tutorial, you will have a reusable EvalPipeline class that scores prompt variants against a rubric, runs them across test cases, and tells you which variant is statistically better — not just anecdotally better.
Why Prompt Evaluation Matters
I have watched teams iterate on prompts for weeks, only to ship the version that happened to produce one good output during a live demo. The fundamental problem is straightforward: without a systematic way to measure prompt quality, every decision is subjective.
Consider a prompt that generates product descriptions. Version A is concise. Version B adds emotional appeal. Version C includes technical specs. Which is "best"? That depends entirely on what you are optimizing for — and you cannot optimize what you do not measure.
Here is a minimal example using the OpenAI API. We will give the LLM a product description and ask it to score the description on a 1-5 scale for clarity.
That jargon-heavy description should score low — probably a 1 or 2. The judge is not guessing. It is applying the rubric you defined. The quality of your rubric directly controls the quality of your evaluation.
Designing an Evaluation Rubric
A rubric is the contract between you and the judge. It defines every criterion, every score level, and every boundary. I have seen teams skip this step and go straight to "rate this output 1-10" — the scores end up meaningless because the judge has no anchor for what a 4 versus a 7 means.
A strong rubric has three properties. First, each criterion is independent — you can score clarity without thinking about tone. Second, each score level has a concrete description — not just "good" and "bad" but what specifically makes a 3 different from a 4. Third, the rubric includes examples at each level so the judge has reference points.
The rubric above evaluates three independent dimensions. Each dimension has a 1-5 scale with specific anchors. When we pass this to the LLM judge, it scores each criterion separately rather than producing a single ambiguous "quality" number.
The next step is turning this rubric into a judge prompt that the LLM can follow consistently. The prompt needs to include the full rubric text, the output to evaluate, and explicit instructions for the response format. We ask the judge to respond in JSON — the same approach covered in the structured output tutorial.
The Judge Function — Scoring a Single Output
The judge_output function wraps the full scoring workflow into a single async call. It takes the OpenAI client, your rubric dictionary, and the text to evaluate. Internally it builds the judge prompt with build_judge_prompt(), sends it to the LLM at temperature=0.0 for deterministic scoring, strips any markdown code fences the model might add around its JSON, and parses the result into a Python dictionary with scores and reasoning keys.
The good description should consistently outscore the bad one on clarity and completeness. Tone might be closer — the jargon-heavy version is not offensive, just empty. This is exactly the kind of nuance that per-criterion scoring reveals. A single "overall quality" score would hide it.
One call is useful for spot-checking. But to compare prompt variants, you need to run the judge across a batch of test cases and aggregate the scores.
Create a function create_email_rubric() that returns a rubric dictionary for evaluating customer support email replies. The rubric must have exactly 2 criteria: "empathy" (does the reply acknowledge the customer's frustration?) and "actionability" (does the reply give clear next steps?). Each criterion needs a "name", "description", and "scale" dict with keys 1 through 5. The function should return the rubric in the same format as our rubric variable above (a dict with a "criteria" key containing a list).
Building the EvalPipeline Class
Calling judge_output manually works for a quick check. But a real evaluation needs structure: multiple prompt variants, multiple test inputs, aggregated scores, and comparison logic. The EvalPipeline class wraps all of this into a single interface.
Before building the class, we need a statistical helper. The permutation_test function tells us whether the score difference between two prompt variants is real or random noise. It pools all scores from both variants, randomly shuffles them into two groups 10,000 times, and counts how often the shuffled difference is as large as the observed one. That fraction is the p-value. I prefer permutation tests over t-tests for LLM evaluation because scores on a 1-5 scale are discrete and rarely follow a normal distribution.
Now for the pipeline class itself. EvalPipeline orchestrates the entire evaluation workflow. You initialize it with a client, rubric, and judge model. Register prompt variants with add_variant() — each variant is a function that takes an input string and returns a prompt. Register test inputs with add_test_case(). When you call run(), it generates output from every variant for every test case, then judges each output against the rubric. summarize() computes mean scores, and compare() runs permutation tests between any two variants.
The class separates four concerns. Variants define how you prompt the model. Test cases define what inputs to evaluate on. Results store the scores for downstream analysis. Comparison runs statistical tests between any two variants. This separation means you can swap rubrics, add test cases, or change the judge model without rewriting anything.
Running a Prompt Comparison
This is where it gets interesting. The prompt your team "knows" is best might score lower on completeness than the version nobody championed. That is exactly why we measure instead of vote — data overrides intuition.
We will define two prompt variants for generating product descriptions, run them against three test products, and compare their scores. One variant focuses on features, the other on benefits.
Six evaluations total — two variants times three test cases. Each evaluation involves one generation call and one judge call, so twelve API calls in total. For larger pipelines, you would run these concurrently, but sequential execution is easier to debug.
Look at the per-criterion scores, not just the averages. A variant that scores 5 on clarity but 2 on completeness tells a different story than one that scores 3 across the board. The first has a fixable problem; the second might need a complete rewrite.
Write a function storytelling_prompt(product) that takes a product name string and returns a prompt string. The prompt should instruct the LLM to write a 2-sentence product description using a storytelling approach — describing a scenario where someone uses the product. The function must return a string that contains both the word "story" (or "storytelling") and the product name.
Analyzing Results with Statistical Significance
Variant A averaged 3.8 and variant B averaged 4.1. Is B actually better, or did it just get lucky on a few test cases? Eyeballing averages leads to wrong conclusions more often than you would expect — which is exactly why we built the permutation test.
Before using it on real pipeline results, let us verify the test works on synthetic data where we know the answer. The two lists below have a clear gap: A averages 3.3 and B averages 4.4. The permutation test should confirm this difference is real with a very low p-value (well below 0.05).
With those synthetic scores, the permutation test should return a p-value near zero — confirming the gap is extremely unlikely to occur by chance. A p-value of 0.03 means: "If there were truly no difference between these two prompts, we would see a gap this large only 3% of the time." Below 0.05 is the standard threshold for statistical significance.
Now let us apply the test to our actual pipeline results. The output table shows, for each rubric criterion: the mean score for each variant, the score difference, the p-value from the permutation test, and an asterisk (*) marking any criterion where p < 0.05.
With only three test cases, the p-values will likely be high — not significant. This is expected. Three data points give very low statistical power. In a real evaluation with 20+ test cases, these same score differences would likely cross the significance threshold.
Generating a Full Report
I always produce a report after each pipeline run — partly for my own decision-making, partly so I have a record when someone asks "why did we choose this prompt?" three months later. The generate_report function below prints a summary table with mean scores per criterion for each variant, an overall average row, and — when there are exactly two variants — a per-criterion significance test with p-values.
Common Mistakes and How to Fix Them
# BAD: No anchor points, judge interprets freely
judge_prompt = """Rate this text quality from 1 to 10.
Text: "Our product is great."
Score:"""# GOOD: Each score level has a concrete meaning
judge_prompt = """Score CLARITY from 1 to 5:
1 = Cannot understand what the product does
3 = Understandable but vague
5 = Crystal clear to any reader
Text: "Our product is great."
Score:"""Without anchors, one LLM call might rate "good" as 7 and another as 5. The rubric removes that ambiguity.
# BAD: Random scoring
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=1.0, # high randomness
)# GOOD: Deterministic scoring
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.0, # consistent results
)High temperature during judging introduces random noise into your scores. The whole point of an evaluation pipeline is to remove subjectivity — do not add randomness back in.
Position Bias and Pairwise Comparison
Our pipeline uses single-output scoring — each output is judged independently against the rubric. There is a second approach called pairwise comparison, where the judge sees two outputs side-by-side and picks the better one. Pairwise is easier for the judge (comparing is simpler than absolute scoring) but introduces a new problem: position bias.
For most prompt evaluation work, I recommend the single-output scoring approach we built in this tutorial. It avoids position bias entirely, produces numeric scores you can aggregate and compare statistically, and scales cleanly to more than two variants. Pairwise comparison is better suited to ranking models against each other — which is why Chatbot Arena uses it. For tools that automate pairwise evaluation at scale, frameworks like Promptfoo and DeepEval are worth exploring.
Write a function is_significant(scores_a, scores_b) that takes two lists of integer scores and returns a dictionary with three keys: "mean_a" (float, mean of scores_a), "mean_b" (float, mean of scores_b), and "significant" (bool, True if the absolute difference in means is >= 0.5 AND the permutation test p-value < 0.05). Use the permutation_test function already defined above. This combines both practical and statistical significance into one check.
When NOT to Use LLM-as-Judge
LLM-as-Judge is powerful but not universal. A judge can confidently rate a factually wrong answer as "clear and well-structured" — which is exactly the kind of failure that burns you in production. Here are the specific scenarios where it will mislead you.
# Subjective quality dimensions
"clarity" # How easy is this to understand?
"tone" # Is this appropriate for the audience?
"completeness" # Does it cover the key points?
"helpfulness" # Would a user find this useful?# Objective correctness dimensions
"factual_accuracy" # Is this medically/legally correct?
"code_correctness" # Does this code actually run?
"math_accuracy" # Is 2+2 really 4 here?
"format_compliance" # Does the JSON parse correctly?Factual accuracy requires ground-truth validation, not LLM scoring. If your prompt generates medical advice, legal information, or numerical calculations, compare against verified answers. Code generation needs execution-based evaluation — run the code against test cases instead of asking a judge if it "looks correct." For detecting factual errors in LLM outputs, see the hallucination detection tutorial.
Frequently Asked Questions
How many test cases do I need for reliable results?
At minimum 20-30 test cases per variant. With fewer than 20, the permutation test has low statistical power — it cannot detect real differences unless they are very large. For production evaluations, 50-100 test cases sampled from actual user inputs give the most trustworthy results.
Can I use the same model as both generator and judge?
You can, but there is a known bias: models tend to prefer their own outputs. If you are evaluating GPT-4o-mini outputs, judging with GPT-4o-mini will slightly inflate scores compared to judging with Claude or Gemini. For critical decisions, use a judge from a different model family or a stronger model in the same family.
How do I evaluate prompts that produce long outputs?
Long outputs strain the judge's context window and reduce scoring accuracy. Two solutions: chunk the output and score each chunk separately then average, or add a "conciseness" criterion to the rubric that penalizes unnecessary length. The second approach also incentivizes the generator to be succinct.
What if my judge keeps returning malformed JSON?
Three fixes in order of reliability. First, add a JSON example to the judge prompt (we did this). Second, set temperature=0.0 — higher temperature causes more formatting errors. Third, use structured output if your provider supports it (response_format={"type": "json_object"} on OpenAI). If the problem persists, wrap the parse in a try/except and retry once.
What to Build Next
You now have a reusable pipeline that scores prompt variants, compares them statistically, and generates reports. Here are the natural next steps to deepen your evaluation practice.