LLM Provider Showdown: Benchmark GPT-4o vs Claude vs Gemini vs Llama on Your Own Tasks
Your team just asked which LLM provider to use for the product. You tested one prompt in ChatGPT and another in Claude, got decent answers from both, and someone wants a "data-driven recommendation." Gut feelings don't survive design reviews.
This tutorial builds a benchmarking framework that sends the same prompts to GPT-4o, Claude, Gemini, and Llama, measures response time and cost, and uses an LLM as an automated judge to score quality. By the end, a recommend_provider() function picks the best model for any task based on actual numbers.
Why Public Benchmarks Don't Tell You Enough
I spent a week reading MMLU scores and Chatbot Arena rankings before choosing a provider for a summarization project. The model that topped general benchmarks produced the worst summaries for our legal documents. Public benchmarks test academic tasks — your product has specific prompts, specific data, and specific quality standards.
A custom benchmark answers three questions no leaderboard can: How fast does each provider respond to your prompts? How good are the outputs for your use case? What does each response actually cost?
The Benchmark Result Structure
Before calling any API, define what data you want to collect. I've been burned by trying to retrofit measurement after the fact — you always forget something important, usually cost or token counts.
Eleven fields capture everything you need to compare providers. The prompt_category field matters more than it looks — it lets you discover that Claude might win at summarization while GPT-4o wins at code generation. A single "best provider" answer is almost always wrong.
Designing Test Prompts That Actually Test Something
The prompts you benchmark with determine whether the results mean anything. I've seen teams benchmark with "Write a haiku about Python" and then wonder why production performance differed. Your test prompts should mirror actual use cases.
Cost Tracking — Pricing Table and Calculator
Every provider prices differently, and the numbers change every few months. I keep a pricing dictionary in every LLM project — two minutes to update and it saves hours of "wait, how much did that cost?" later.
GPT-4o is roughly 25x more expensive than GPT-4o-mini for the same prompt. Gemini Flash is the cheapest cloud option. Ollama is free but requires your own hardware. These trade-offs are exactly why benchmarking matters.
The cost function takes a model name and token counts, looks up the rates, and multiplies. It returns 0.0 for unknown models rather than crashing — important when you add a new model mid-benchmark.
Benchmarking Each Provider
This is where the framework earns its keep. One function per provider sends a prompt, captures the response, measures latency, extracts token usage, and calculates cost. Each returns a dictionary matching our result template.
Notice temperature=0.0. This makes responses as deterministic as possible, which is what you want for a fair comparison. If you're benchmarking creative writing, bump it to 0.7 — but keep it consistent across all providers.
The Anthropic benchmark follows the same pattern with two API shape differences: Claude uses messages.create() instead of chat.completions.create(), and token usage lives in response.usage.input_tokens rather than response.usage.prompt_tokens.
Running the Full Benchmark Suite
Individual provider functions are ready. The runner loops through all prompts and all providers, collecting results into a flat list. Every row is one prompt-provider pair, which makes analysis straightforward.
The try/except block is essential. If one API key is wrong or a provider is down, the benchmark still finishes. I learned this after losing 20 minutes of Claude and Gemini results because an OpenAI rate limit killed the whole run.
Analyzing Speed and Cost
Raw results need summarization. These functions compute averages per provider and per category — pure Python, no pandas required.
Averages hide important differences though. A provider might be fast on short prompts and slow on complex ones. The per-category view reveals these patterns.
Write a function format_comparison(results) that takes a list of benchmark result dictionaries and returns a formatted string comparing providers.
The function should:
1. Group results by provider
2. Calculate the average latency for each provider
3. Return a multi-line string with one line per provider: "provider_name: avg_latency=X.XXs count=N"
4. Sort lines alphabetically by provider name
Each result dict has at minimum: "provider" (str), "latency_seconds" (float), "model" (str). Skip results where model == "error".
LLM-as-Judge — Automated Quality Scoring
Speed and cost are easy to measure. Quality is the hard part. Reading every response manually doesn't scale once you have 50+ prompt-provider combinations. The LLM-as-Judge pattern uses one model to score another model's output.
Send the original prompt, the model's response, and the evaluation criteria to a judge model. It returns a score from 1 to 5 with reasoning. I typically use GPT-4o-mini as the judge — it's cheap, fast, and surprisingly consistent.
With the judge function ready, we score all collected results in a single pass. This mutates the result dictionaries in place — adding the quality score and reasoning to each one.
Leaderboard and Recommendations
All data is collected and scored. The leaderboard combines speed, cost, and quality into a ranked view. I find that printing both overall and per-category rankings is crucial — the overall winner is rarely the best at everything.
A leaderboard shows ranks, but doesn't answer "which provider for this task?" The recommendation function takes a category and a priority and returns the best provider for that combination.
That function is what you bring to the design review. Instead of "I think Claude is better," you say "Claude scored 4.7/5 on summarization at $0.003 per call, while GPT-4o scored 4.5 at $0.008 — Claude is 40% cheaper with slightly higher quality."
Write a function weighted_score(quality, latency, cost, weights) that computes a single composite score from three metrics.
quality: score 1-5 (higher is better)latency: seconds (lower is better)cost: USD (lower is better)weights: dict like {"quality": 0.5, "speed": 0.3, "cost": 0.2}Normalize each metric to 0-1:
1. quality_norm = quality / 5
2. speed_norm = max(0, 1 - latency / 10)
3. cost_norm = max(0, 1 - cost / 0.01)
Return the weighted sum rounded to 4 decimal places.
Common Benchmarking Mistakes
I've reviewed benchmarks from three different teams, and the same mistakes appear every time. These aren't style preferences — each one produces misleading results that could cost thousands in the wrong provider choice.
# OpenAI
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.7, # Creative
)
# Claude — different temperature!
response = await client.messages.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": prompt}],
temperature=0.0, # Deterministic
)BENCHMARK_TEMP = 0.0 # One setting for all
# OpenAI
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=BENCHMARK_TEMP,
)
# Claude — same temperature
response = await client.messages.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": prompt}],
temperature=BENCHMARK_TEMP,
)Another mistake: comparing a frontier model (GPT-4o) against a lightweight model (Gemini Flash) and concluding "Gemini is 20x cheaper." That's like comparing a sedan to a bicycle on price. Compare models in the same tier — GPT-4o vs Claude Sonnet vs Gemini Pro — and lightweight models against each other.
Extending the Framework
The core workflow handles send-measure-score-rank. Here are three extensions that take it closer to production use.
Write a function recommendation_report(results) that produces a per-category recommendation.
The function should:
1. Find all unique prompt_category values in the results
2. For each category, find the provider with the highest average quality_score
3. Return a dictionary mapping each category to the winning provider name
4. Skip results where model == "error" or quality_score == 0
5. If a category has no valid results, map it to "none"
When Benchmarking Isn't Worth It
Not every project needs a formal benchmark. For a quick prototype or internal tool with low traffic, just pick GPT-4o-mini or Gemini Flash and start building. The cost difference at low volume is negligible — your engineering time is worth more than the $2 you'd save per month.
Benchmarking pays off for production systems handling thousands of requests daily. At that scale, a 30% cost difference means real money, and a 0.5-second latency gap affects user experience. That's when you pull out this framework, define your actual prompts, and let the numbers decide.
Summary
The key insight: choosing an LLM provider is not a one-time decision. Models get updated, pricing changes, and your use cases evolve. The framework you built is designed to be re-run whenever the landscape shifts.
You now have a framework that turns "I think Provider X is best" into "Provider X scores 4.3/5 on our tasks at $0.004 per call with 1.8s latency." That's the difference between a gut feeling and a data-driven recommendation.
References
Frequently Asked Questions
How many test prompts do I need for reliable results?
For a quick directional signal, 5 per category works. For a production architecture decision, use 15-20 prompts per category with 3 runs each. That gives enough data to spot variance and outliers.
Should I benchmark the latest model from each provider?
Benchmark the model you'd actually deploy. If GPT-4o-mini handles your task, benchmarking GPT-4o wastes API spend. Start with the cheapest models and only move up if quality doesn't meet your bar.
Can I use Claude or Gemini as the judge?
Any model with strong instruction-following works. The requirement is consistency — the judge should score similarly across reruns. Claude Haiku and Gemini Flash are both viable. Just avoid using a model as both judge and contestant.
How often should I re-run benchmarks?
Quarterly, or when a provider ships a major model update. OpenAI, Anthropic, and Google each release new models every few months. Save your test prompts and scoring criteria so re-runs take one command.