Skip to main content

LLM Provider Showdown: Benchmark GPT-4o vs Claude vs Gemini vs Llama on Your Own Tasks

Beginner90 min3 exercises50 XP
0/3 exercises

Your team just asked which LLM provider to use for the product. You tested one prompt in ChatGPT and another in Claude, got decent answers from both, and someone wants a "data-driven recommendation." Gut feelings don't survive design reviews.

This tutorial builds a benchmarking framework that sends the same prompts to GPT-4o, Claude, Gemini, and Llama, measures response time and cost, and uses an LLM as an automated judge to score quality. By the end, a recommend_provider() function picks the best model for any task based on actual numbers.

Why Public Benchmarks Don't Tell You Enough

I spent a week reading MMLU scores and Chatbot Arena rankings before choosing a provider for a summarization project. The model that topped general benchmarks produced the worst summaries for our legal documents. Public benchmarks test academic tasks — your product has specific prompts, specific data, and specific quality standards.

A custom benchmark answers three questions no leaderboard can: How fast does each provider respond to your prompts? How good are the outputs for your use case? What does each response actually cost?

The Benchmark Result Structure

Before calling any API, define what data you want to collect. I've been burned by trying to retrofit measurement after the fact — you always forget something important, usually cost or token counts.

Benchmark result template
Loading editor...

Eleven fields capture everything you need to compare providers. The prompt_category field matters more than it looks — it lets you discover that Claude might win at summarization while GPT-4o wins at code generation. A single "best provider" answer is almost always wrong.

Designing Test Prompts That Actually Test Something

The prompts you benchmark with determine whether the results mean anything. I've seen teams benchmark with "Write a haiku about Python" and then wonder why production performance differed. Your test prompts should mirror actual use cases.

Test prompts by category
Loading editor...

Cost Tracking — Pricing Table and Calculator

Every provider prices differently, and the numbers change every few months. I keep a pricing dictionary in every LLM project — two minutes to update and it saves hours of "wait, how much did that cost?" later.

Per-token pricing for each model
Loading editor...

GPT-4o is roughly 25x more expensive than GPT-4o-mini for the same prompt. Gemini Flash is the cheapest cloud option. Ollama is free but requires your own hardware. These trade-offs are exactly why benchmarking matters.

The cost function takes a model name and token counts, looks up the rates, and multiplies. It returns 0.0 for unknown models rather than crashing — important when you add a new model mid-benchmark.

Cost calculator function
Loading editor...

Benchmarking Each Provider

This is where the framework earns its keep. One function per provider sends a prompt, captures the response, measures latency, extracts token usage, and calculates cost. Each returns a dictionary matching our result template.

Benchmark function for OpenAI
Loading editor...

Notice temperature=0.0. This makes responses as deterministic as possible, which is what you want for a fair comparison. If you're benchmarking creative writing, bump it to 0.7 — but keep it consistent across all providers.

The Anthropic benchmark follows the same pattern with two API shape differences: Claude uses messages.create() instead of chat.completions.create(), and token usage lives in response.usage.input_tokens rather than response.usage.prompt_tokens.

Benchmark function for Anthropic Claude
Loading editor...
Benchmark function for Google Gemini
Loading editor...

Running the Full Benchmark Suite

Individual provider functions are ready. The runner loops through all prompts and all providers, collecting results into a flat list. Every row is one prompt-provider pair, which makes analysis straightforward.

Full benchmark runner with error handling
Loading editor...

The try/except block is essential. If one API key is wrong or a provider is down, the benchmark still finishes. I learned this after losing 20 minutes of Claude and Gemini results because an OpenAI rate limit killed the whole run.

Analyzing Speed and Cost

Raw results need summarization. These functions compute averages per provider and per category — pure Python, no pandas required.

Provider summary statistics
Loading editor...

Averages hide important differences though. A provider might be fast on short prompts and slow on complex ones. The per-category view reveals these patterns.

Per-category breakdown
Loading editor...
Build a Comparison Formatter
Write Code

Write a function format_comparison(results) that takes a list of benchmark result dictionaries and returns a formatted string comparing providers.

The function should:

1. Group results by provider

2. Calculate the average latency for each provider

3. Return a multi-line string with one line per provider: "provider_name: avg_latency=X.XXs count=N"

4. Sort lines alphabetically by provider name

Each result dict has at minimum: "provider" (str), "latency_seconds" (float), "model" (str). Skip results where model == "error".

Loading editor...

LLM-as-Judge — Automated Quality Scoring

Speed and cost are easy to measure. Quality is the hard part. Reading every response manually doesn't scale once you have 50+ prompt-provider combinations. The LLM-as-Judge pattern uses one model to score another model's output.

Send the original prompt, the model's response, and the evaluation criteria to a judge model. It returns a score from 1 to 5 with reasoning. I typically use GPT-4o-mini as the judge — it's cheap, fast, and surprisingly consistent.

LLM-as-Judge scoring function
Loading editor...

With the judge function ready, we score all collected results in a single pass. This mutates the result dictionaries in place — adding the quality score and reasoning to each one.

Score all results
Loading editor...

Leaderboard and Recommendations

All data is collected and scored. The leaderboard combines speed, cost, and quality into a ranked view. I find that printing both overall and per-category rankings is crucial — the overall winner is rarely the best at everything.

Overall leaderboard
Loading editor...

A leaderboard shows ranks, but doesn't answer "which provider for this task?" The recommendation function takes a category and a priority and returns the best provider for that combination.

Provider recommendation function
Loading editor...

That function is what you bring to the design review. Instead of "I think Claude is better," you say "Claude scored 4.7/5 on summarization at $0.003 per call, while GPT-4o scored 4.5 at $0.008 — Claude is 40% cheaper with slightly higher quality."

Implement Weighted Scoring
Write Code

Write a function weighted_score(quality, latency, cost, weights) that computes a single composite score from three metrics.

  • quality: score 1-5 (higher is better)
  • latency: seconds (lower is better)
  • cost: USD (lower is better)
  • weights: dict like {"quality": 0.5, "speed": 0.3, "cost": 0.2}
  • Normalize each metric to 0-1:

    1. quality_norm = quality / 5

    2. speed_norm = max(0, 1 - latency / 10)

    3. cost_norm = max(0, 1 - cost / 0.01)

    Return the weighted sum rounded to 4 decimal places.

    Loading editor...

    Common Benchmarking Mistakes

    I've reviewed benchmarks from three different teams, and the same mistakes appear every time. These aren't style preferences — each one produces misleading results that could cost thousands in the wrong provider choice.

    Inconsistent temperature across providers
    # OpenAI
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,  # Creative
    )
    
    # Claude — different temperature!
    response = await client.messages.create(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,  # Deterministic
    )
    Same temperature for all providers
    BENCHMARK_TEMP = 0.0  # One setting for all
    
    # OpenAI
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=BENCHMARK_TEMP,
    )
    
    # Claude — same temperature
    response = await client.messages.create(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": prompt}],
        temperature=BENCHMARK_TEMP,
    )

    Another mistake: comparing a frontier model (GPT-4o) against a lightweight model (Gemini Flash) and concluding "Gemini is 20x cheaper." That's like comparing a sedan to a bicycle on price. Compare models in the same tier — GPT-4o vs Claude Sonnet vs Gemini Pro — and lightweight models against each other.

    Extending the Framework

    The core workflow handles send-measure-score-rank. Here are three extensions that take it closer to production use.

    Framework extensions
    Loading editor...
    Build a Category-Aware Recommendation Report
    Write Code

    Write a function recommendation_report(results) that produces a per-category recommendation.

    The function should:

    1. Find all unique prompt_category values in the results

    2. For each category, find the provider with the highest average quality_score

    3. Return a dictionary mapping each category to the winning provider name

    4. Skip results where model == "error" or quality_score == 0

    5. If a category has no valid results, map it to "none"

    Loading editor...

    When Benchmarking Isn't Worth It

    Not every project needs a formal benchmark. For a quick prototype or internal tool with low traffic, just pick GPT-4o-mini or Gemini Flash and start building. The cost difference at low volume is negligible — your engineering time is worth more than the $2 you'd save per month.

    Benchmarking pays off for production systems handling thousands of requests daily. At that scale, a 30% cost difference means real money, and a 0.5-second latency gap affects user experience. That's when you pull out this framework, define your actual prompts, and let the numbers decide.

    Summary

    The key insight: choosing an LLM provider is not a one-time decision. Models get updated, pricing changes, and your use cases evolve. The framework you built is designed to be re-run whenever the landscape shifts.

    Workflow recap
    Loading editor...

    You now have a framework that turns "I think Provider X is best" into "Provider X scores 4.3/5 on our tasks at $0.004 per call with 1.8s latency." That's the difference between a gut feeling and a data-driven recommendation.

    References

  • OpenAI API Reference — Chat completions, token usage, pricing
  • Anthropic API Documentation — Messages API, content blocks, usage tracking
  • Google AI for Developers — Gemini API, GenerateContent, usage metadata
  • Ollama Documentation — Local model serving, OpenAI-compatible endpoint
  • LMSYS Chatbot Arena — Crowdsourced LLM evaluation and Elo rankings
  • Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (2023) — Research on LLM evaluation reliability
  • OpenAI Pricing — Current GPT-4o and GPT-4o-mini token pricing
  • Anthropic Pricing — Claude model pricing tiers
  • Frequently Asked Questions

    How many test prompts do I need for reliable results?

    For a quick directional signal, 5 per category works. For a production architecture decision, use 15-20 prompts per category with 3 runs each. That gives enough data to spot variance and outliers.

    Should I benchmark the latest model from each provider?

    Benchmark the model you'd actually deploy. If GPT-4o-mini handles your task, benchmarking GPT-4o wastes API spend. Start with the cheapest models and only move up if quality doesn't meet your bar.

    Can I use Claude or Gemini as the judge?

    Any model with strong instruction-following works. The requirement is consistency — the judge should score similarly across reruns. Claude Haiku and Gemini Flash are both viable. Just avoid using a model as both judge and contestant.

    How often should I re-run benchmarks?

    Quarterly, or when a provider ships a major model update. OpenAI, Anthropic, and Google each release new models every few months. Save your test prompts and scoring criteria so re-runs take one command.

    Related Tutorials