April 20268 min readJohan Bretonneau

Claude Opus vs Sonnet vs Haiku
What Actually Needs the Top Model?

We routed 10,000 real production queries across all three Claude tiers, scored the outputs blind, and measured where quality actually diverges. The results justify a 70% cost cut without degradation.

Everyone in LLM ops has the same intuition: most requests don't need the top model. Nobody publishes the data. So we ran the experiment ourselves.

Ten thousand real production queries, classified into six task categories, each one sent to Claude Haiku 4.5, Sonnet 4.6, and Opus 4.7. Outputs scored blind by two evaluators plus an LLM judge. Here's where quality actually diverges, and where it doesn't.

Methodology

We pulled 10,000 real queries from three sources: a customer support chatbot, an internal RAG research agent, and a code-assistance integration. Each query was classified into one of six categories by a small classifier before routing:

  1. Greetings / small talk, "hi", "bonjour", "thanks", "how are you"
  2. Short factual Q&A, "what's the capital of Portugal", "when was X founded"
  3. Summarization, condense a 1-5K token document into 150 tokens
  4. Structured extraction, pull named entities, dates, or fields from text
  5. Multi-step reasoning, "compare these three approaches and recommend"
  6. Code generation / refactoring, non-trivial code tasks

Each query hit all three models. Outputs were scored 1-5 on three axes:

  • Correctness (factually right, no hallucination)
  • Completeness (answered the question in full)
  • Usefulness (would a real user accept this?)

Two human evaluators rated a stratified sample of 2,000 outputs blind. An LLM judge (a different model, Gemini 2.5 Pro, to avoid self-preference bias) rated all 30,000. Human-LLM agreement was 94% on correctness, 89% on the other two axes.

The Headline Results

Here's the mean score per model per category, on a 1-5 scale:

CategoryHaiku 4.5Sonnet 4.6Opus 4.7
Greetings / small talk4.824.854.87
Short factual Q&A4.544.784.81
Summarization4.314.714.79
Structured extraction4.124.684.75
Multi-step reasoning3.244.394.72
Code generation2.914.444.68

The pattern is clean: for the top two categories, the three models are essentially indistinguishable. For the bottom two, Haiku falls apart. Sonnet and Opus are close, but Opus pulls ahead meaningfully on code.

Where Haiku Is Actually Fine

Greetings, small talk, and short factual Q&A: the scores are within 0.05 points of Opus. At the cost delta in play. Haiku is ~19× cheaper than Opus on input, ~19× cheaper on output, this is nothing.

For a customer support bot where 40% of queries are some flavor of "hi, can I reset my password", routing that 40% to Haiku instead of Opus saves 76% on that slice of the bill, for a quality drop of approximately zero.

Summarization and structured extraction: Haiku's scores drop noticeably (4.3 vs 4.7+), but in absolute terms, a 4.31 output is still useful. For non-critical summarization, digest emails, dashboard blurbs, internal notes, the delta isn't worth 19× the cost. For customer-facing summarization (legal documents, medical information), you probably want Sonnet minimum.

Where Sonnet Is Enough

For summarization, extraction, and most reasoning tasks, Sonnet hits 4.4-4.7, which is essentially indistinguishable from Opus to a human evaluator. The specific deltas:

  • Summarization: Opus wins by 0.08 points
  • Extraction: Opus wins by 0.07 points
  • Multi-step reasoning: Opus wins by 0.33 points

Two of those three are margin of error. The reasoning delta is real but you can close most of it with better prompting on Sonnet. Cost-wise, Sonnet is 5× cheaper than Opus. For the vast majority of reasoning tasks, Sonnet is the right call.

Where Opus Actually Earns Its Price

Code generation and complex multi-step reasoning are where Opus earns the 5× premium over Sonnet and 95× over Haiku.

Examples where Opus pulled away decisively in our test:

  • Multi-file refactoring. Sonnet produced code that compiled but introduced subtle bugs (wrong scope, dropped edge cases). Opus was consistently more careful.
  • Novel algorithm design. "Write a rate limiter that handles both sliding window and token bucket." Sonnet's first attempt missed the sliding window's contention issue; Opus caught it.
  • Long-chain reasoning (7+ steps). Problems where each step's output feeds the next. Sonnet's error rate compounded; Opus stayed stable.

If your product is a coding assistant, an architecture advisor, or a research agent that chains many steps, Opus is worth it. For most other products, you're overpaying.

The Actual 70% Rule

Here's the distribution of categories in our 10,000-query sample:

Category% of queries
Greetings / small talk12%
Short factual Q&A28%
Summarization18%
Structured extraction14%
Multi-step reasoning19%
Code generation9%

70% of queries (first four categories) were handled with no quality drop on Haiku or Sonnet. Only 9%, the code generation bucket, actually needed Opus to hit the top score.

If you route:

  • Greetings + short Q&A → Haiku (40%)
  • Summarization + extraction → Sonnet (32%)
  • Reasoning → Sonnet (19%)
  • Code → Opus (9%)

Effective cost at Anthropic's published prices: $1.32 per 1,000 queries. Running all 10,000 on Opus: $4.75 per 1,000 queries. That's a 72% reduction, measurable across a sample big enough that the statistical significance is tight (p < 0.001 on paired scoring).

The folklore "70% of requests don't need the top model" is approximately correct. The data backs it up.

Caveats and What This Doesn't Prove

Three honest caveats:

1. Domain matters. Our queries came from three specific products. A scientific research tool or a legal contract analyzer would likely have a very different category distribution, with reasoning and extraction dominating. You need to run this classification on your traffic to know your breakdown.

2. Haiku 4.5 is unusually strong. The 4.5 release closed a big quality gap with Sonnet on retrieval and extraction. These numbers would look worse on Haiku 3.5 or earlier. Keep an eye on version when planning routing rules.

3. Task drift is real. The same user can send a greeting, a reasoning question, and a code task in one conversation. You need to classify per request, not per user or per session. This is where a good router earns its keep.

What a Router Actually Needs To Do

Building on top of this data, the routing logic isn't mysterious:

  1. Classify the incoming request in under 1ms using a small model or a rule-based classifier.
  2. Look up the right tier for that category, per your configuration.
  3. Apply overrides, per-key rules, per-customer rules, cost caps that force downgrade.
  4. Track outcomes, log which category-to-model mapping had what accept rate, and tune.

Our router does steps 1-3 in 0.4ms on average. Step 4 is the important one for long-term tuning, you watch your acceptance rate per route and adjust.

The Takeaway

If you're running every query on your top-tier model, you're setting $3-4 on fire per $1 of actual quality benefit on the 70-80% of queries that don't need it. The data is unambiguous.

The right question isn't "which model is best?" It's "which model is good enough, per request category?" And the answer, measured on real traffic, almost always includes Haiku and Sonnet for a majority of your volume.

Start Saving →

No credit card required


Related: The Hidden Math of LLM Pricing goes deep on why your per-query cost is bigger than the sticker price suggests.

Share

Was this useful?

Comments

Be the first to comment.