LLM Provider Showdown: Benchmark GPT-4o vs Claude vs Gemini vs Llama on Your Own Tasks

Beginner90 min3 exercises50 XP

Prerequisites

OpenAI API Crash Course Anthropic Claude API Google Gemini API Run LLMs Locally with Ollama

0/3 exercises

Your team just asked which LLM provider to use for the product. You tested one prompt in ChatGPT and another in Claude, got decent answers from both, and someone wants a "data-driven recommendation." Gut feelings don't survive design reviews.

This tutorial builds a benchmarking framework that sends the same prompts to GPT-4o, Claude, Gemini, and Llama, measures response time and cost, and uses an LLM as an automated judge to score quality. By the end, a recommend_provider() function picks the best model for any task based on actual numbers.

Why Public Benchmarks Don't Tell You Enough

I spent a week reading MMLU scores and Chatbot Arena rankings before choosing a provider for a summarization project. The model that topped general benchmarks produced the worst summaries for our legal documents. Public benchmarks test academic tasks — your product has specific prompts, specific data, and specific quality standards.

A custom benchmark answers three questions no leaderboard can: How fast does each provider respond to your prompts? How good are the outputs for your use case? What does each response actually cost?

The Benchmark Result Structure

Before calling any API, define what data you want to collect. I've been burned by trying to retrofit measurement after the fact — you always forget something important, usually cost or token counts.

Benchmark result template

Loading editor...

Eleven fields capture everything you need to compare providers. The prompt_category field matters more than it looks — it lets you discover that Claude might win at summarization while GPT-4o wins at code generation. A single "best provider" answer is almost always wrong.

Designing Test Prompts That Actually Test Something

The prompts you benchmark with determine whether the results mean anything. I've seen teams benchmark with "Write a haiku about Python" and then wonder why production performance differed. Your test prompts should mirror actual use cases.

Test prompts by category

Loading editor...

Cost Tracking — Pricing Table and Calculator

Every provider prices differently, and the numbers change every few months. I keep a pricing dictionary in every LLM project — two minutes to update and it saves hours of "wait, how much did that cost?" later.

Per-token pricing for each model

Loading editor...

GPT-4o is roughly 25x more expensive than GPT-4o-mini for the same prompt. Gemini Flash is the cheapest cloud option. Ollama is free but requires your own hardware. These trade-offs are exactly why benchmarking matters.

The cost function takes a model name and token counts, looks up the rates, and multiplies. It returns 0.0 for unknown models rather than crashing — important when you add a new model mid-benchmark.

Cost calculator function

Loading editor...

Benchmarking Each Provider

This is where the framework earns its keep. One function per provider sends a prompt, captures the response, measures latency, extracts token usage, and calculates cost. Each returns a dictionary matching our result template.

Benchmark function for OpenAI

Loading editor...

Notice temperature=0.0. This makes responses as deterministic as possible, which is what you want for a fair comparison. If you're benchmarking creative writing, bump it to 0.7 — but keep it consistent across all providers.

The Anthropic benchmark follows the same pattern with two API shape differences: Claude uses messages.create() instead of chat.completions.create(), and token usage lives in response.usage.input_tokens rather than response.usage.prompt_tokens.

Benchmark function for Anthropic Claude

Loading editor...

Benchmark function for Google Gemini

Loading editor...

Running the Full Benchmark Suite

Individual provider functions are ready. The runner loops through all prompts and all providers, collecting results into a flat list. Every row is one prompt-provider pair, which makes analysis straightforward.

Full benchmark runner with error handling

Loading editor...

The try/except block is essential. If one API key is wrong or a provider is down, the benchmark still finishes. I learned this after losing 20 minutes of Claude and Gemini results because an OpenAI rate limit killed the whole run.

Analyzing Speed and Cost

Raw results need summarization. These functions compute averages per provider and per category — pure Python, no pandas required.

Provider summary statistics

Loading editor...

Averages hide important differences though. A provider might be fast on short prompts and slow on complex ones. The per-category view reveals these patterns.

Per-category breakdown

Loading editor...

Build a Comparison Formatter

Write Code

Write a function format_comparison(results) that takes a list of benchmark result dictionaries and returns a formatted string comparing providers.

The function should:

1. Group results by provider

2. Calculate the average latency for each provider

3. Return a multi-line string with one line per provider: "provider_name: avg_latency=X.XXs count=N"

4. Sort lines alphabetically by provider name

Each result dict has at minimum: "provider" (str), "latency_seconds" (float), "model" (str). Skip results where model == "error".

Loading editor...

LLM-as-Judge — Automated Quality Scoring

Speed and cost are easy to measure. Quality is the hard part. Reading every response manually doesn't scale once you have 50+ prompt-provider combinations. The LLM-as-Judge pattern uses one model to score another model's output.

Send the original prompt, the model's response, and the evaluation criteria to a judge model. It returns a score from 1 to 5 with reasoning. I typically use GPT-4o-mini as the judge — it's cheap, fast, and surprisingly consistent.

LLM-as-Judge scoring function

Loading editor...

With the judge function ready, we score all collected results in a single pass. This mutates the result dictionaries in place — adding the quality score and reasoning to each one.

Score all results

Loading editor...

Leaderboard and Recommendations

All data is collected and scored. The leaderboard combines speed, cost, and quality into a ranked view. I find that printing both overall and per-category rankings is crucial — the overall winner is rarely the best at everything.

Overall leaderboard

Loading editor...

A leaderboard shows ranks, but doesn't answer "which provider for this task?" The recommendation function takes a category and a priority and returns the best provider for that combination.

Provider recommendation function

Loading editor...

That function is what you bring to the design review. Instead of "I think Claude is better," you say "Claude scored 4.7/5 on summarization at $0.003 per call, while GPT-4o scored 4.5 at $0.008 — Claude is 40% cheaper with slightly higher quality."

Implement Weighted Scoring

Write Code

Write a function weighted_score(quality, latency, cost, weights) that computes a single composite score from three metrics.

quality: score 1-5 (higher is better)

latency: seconds (lower is better)

cost: USD (lower is better)

weights: dict like {"quality": 0.5, "speed": 0.3, "cost": 0.2}

Normalize each metric to 0-1:

1. quality_norm = quality / 5

2. speed_norm = max(0, 1 - latency / 10)

3. cost_norm = max(0, 1 - cost / 0.01)

Return the weighted sum rounded to 4 decimal places.

Loading editor...

Common Benchmarking Mistakes

I've reviewed benchmarks from three different teams, and the same mistakes appear every time. These aren't style preferences — each one produces misleading results that could cost thousands in the wrong provider choice.

Inconsistent temperature across providers

# OpenAI
response = await client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.7,  # Creative
)

# Claude — different temperature!
response = await client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0,  # Deterministic
)

Same temperature for all providers

BENCHMARK_TEMP = 0.0  # One setting for all

# OpenAI
response = await client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=BENCHMARK_TEMP,
)

# Claude — same temperature
response = await client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": prompt}],
    temperature=BENCHMARK_TEMP,
)

Another mistake: comparing a frontier model (GPT-4o) against a lightweight model (Gemini Flash) and concluding "Gemini is 20x cheaper." That's like comparing a sedan to a bicycle on price. Compare models in the same tier — GPT-4o vs Claude Sonnet vs Gemini Pro — and lightweight models against each other.

Extending the Framework

The core workflow handles send-measure-score-rank. Here are three extensions that take it closer to production use.

Framework extensions

Loading editor...

Build a Category-Aware Recommendation Report

Write Code

Write a function recommendation_report(results) that produces a per-category recommendation.

The function should:

1. Find all unique prompt_category values in the results

2. For each category, find the provider with the highest average quality_score

3. Return a dictionary mapping each category to the winning provider name

4. Skip results where model == "error" or quality_score == 0

5. If a category has no valid results, map it to "none"

Loading editor...

When Benchmarking Isn't Worth It

Not every project needs a formal benchmark. For a quick prototype or internal tool with low traffic, just pick GPT-4o-mini or Gemini Flash and start building. The cost difference at low volume is negligible — your engineering time is worth more than the $2 you'd save per month.

Benchmarking pays off for production systems handling thousands of requests daily. At that scale, a 30% cost difference means real money, and a 0.5-second latency gap affects user experience. That's when you pull out this framework, define your actual prompts, and let the numbers decide.

Summary

The key insight: choosing an LLM provider is not a one-time decision. Models get updated, pricing changes, and your use cases evolve. The framework you built is designed to be re-run whenever the landscape shifts.

Workflow recap

Loading editor...

You now have a framework that turns "I think Provider X is best" into "Provider X scores 4.3/5 on our tasks at $0.004 per call with 1.8s latency." That's the difference between a gut feeling and a data-driven recommendation.

References

OpenAI API Reference — Chat completions, token usage, pricing

Anthropic API Documentation — Messages API, content blocks, usage tracking

Google AI for Developers — Gemini API, GenerateContent, usage metadata

Ollama Documentation — Local model serving, OpenAI-compatible endpoint

LMSYS Chatbot Arena — Crowdsourced LLM evaluation and Elo rankings

Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (2023) — Research on LLM evaluation reliability

OpenAI Pricing — Current GPT-4o and GPT-4o-mini token pricing

Anthropic Pricing — Claude model pricing tiers

Frequently Asked Questions

How many test prompts do I need for reliable results?

For a quick directional signal, 5 per category works. For a production architecture decision, use 15-20 prompts per category with 3 runs each. That gives enough data to spot variance and outliers.

Should I benchmark the latest model from each provider?

Benchmark the model you'd actually deploy. If GPT-4o-mini handles your task, benchmarking GPT-4o wastes API spend. Start with the cheapest models and only move up if quality doesn't meet your bar.

Can I use Claude or Gemini as the judge?

Any model with strong instruction-following works. The requirement is consistency — the judge should score similarly across reruns. Claude Haiku and Gemini Flash are both viable. Just avoid using a model as both judge and contestant.

How often should I re-run benchmarks?

Quarterly, or when a provider ships a major model update. OpenAI, Anthropic, and Google each release new models every few months. Save your test prompts and scoring criteria so re-runs take one command.

Why Public Benchmarks Don't Tell You Enough

The Benchmark Result Structure

Designing Test Prompts That Actually Test Something

Cost Tracking — Pricing Table and Calculator

Benchmarking Each Provider

Running the Full Benchmark Suite

Analyzing Speed and Cost

LLM-as-Judge — Automated Quality Scoring

Leaderboard and Recommendations

Common Benchmarking Mistakes

Extending the Framework

When Benchmarking Isn't Worth It

Summary

References

Frequently Asked Questions

How many test prompts do I need for reliable results?

Should I benchmark the latest model from each provider?

Can I use Claude or Gemini as the judge?

How often should I re-run benchmarks?

Related Tutorials