May 20265 min readJohan Bretonneau

LLM cost at scale: what happens at 10B, 50B, and 100B tokens/month

A concrete cost model for teams running 10-100 billion tokens per month through LLM APIs. Where the money goes, how routing changes the math, and what the invoice looks like.

Most LLM cost guides stop at "here is the price per million tokens for each model". That is useful at prototype scale. At production scale - 10B, 50B, 100B tokens per month - the pricing arithmetic changes in ways that are not obvious from the rate cards.

This post is the guide I wish existed when we first started hitting those volumes.

The baseline: what $1M in token costs actually buys

Let's anchor on real numbers from the major providers as of May 2026.

GPT-4o: $2.50/1M input · $10/1M output
Claude Sonnet 4.6: $3/1M input · $15/1M output
Gemini 2.0 Pro: $1.25/1M input · $5/1M output
GPT-4o mini: $0.15/1M input · $0.60/1M output
Claude Haiku 4.5: $0.80/1M input · $4/1M output
Gemini Flash 2.0: $0.075/1M input · $0.30/1M output

A typical production workload has a 70/30 input/output split. Let's call the blended rate for "send 1M tokens through a flagship model" roughly $4-5 M tokens all-in.

At 10B tokens/month through a flagship: $40,000-50,000/month. At 100B tokens/month: $400,000-500,000/month.

The routing insight: not all tokens need flagship intelligence

Here is the thing most teams discover late: in a mixed production workload, roughly 60-80% of requests are not flagship-grade tasks.

  • User inputs to a RAG pipeline: mostly retrieval prompts
  • Document classification at scale: rule-following, not reasoning
  • Structured extraction from forms: repetitive, low-complexity
  • Short customer support answers: template-filling
  • Internal tooling calls: function dispatch with known schemas

These tasks run on mini/flash models at 10-20× lower cost per token. The other 20-40% - complex reasoning, multi-step agent tasks, nuanced generation - genuinely needs flagship capacity.

A smart router that scores request complexity and routes accordingly captures most of that gap without sacrificing quality.

The math at three scales

10B tokens/month

Full flagship routing (no smart routing):

  • ~7B input @ $2.50 = $17,500
  • ~3B output @ $10 = $30,000
  • Total: ~$47,500/month

With smart routing (65% downgraded to mini/flash):

  • 3.5B input flagship @ $2.50 = $8,750
  • 1.5B output flagship @ $10 = $15,000
  • 4.55B input mini @ $0.15 = $682
  • 1.95B output mini @ $0.60 = $1,170
  • Total: ~$25,600/month - 45% saving

With aggressive smart routing (80% downgraded):

  • 1.4B input flagship @ $2.50 = $3,500
  • 0.6B output flagship @ $10 = $6,000
  • 5.6B input flash @ $0.075 = $420
  • 2.4B output flash @ $0.30 = $720
  • Total: ~$10,640/month - 78% saving

The variance is wide because it depends heavily on your workload mix. Heavily agentic (lots of multi-step reasoning) = less downgrade potential. Heavily pipeline-based (RAG, classification, extraction) = more.

50B tokens/month

At 5× the volume, the absolute savings become significant:

ScenarioMonthly costAnnual cost
All flagship~$237,500~$2.85M
65% smart routing~$128,000~$1.54M
80% smart routing~$53,200~$638K

The gap between "no routing" and "aggressive routing" is $1.3M-2.2M per year at this scale. That is a meaningful line item.

100B tokens/month

At this scale, you are also likely in negotiated enterprise pricing territory with at least one provider. Let's use list rates as the ceiling:

ScenarioMonthly costAnnual cost
All flagship~$475,000~$5.7M
65% smart routing~$256,000~$3.07M
80% smart routing~$106,400~$1.28M

At 100B tokens/month, a well-tuned router saves $4.4M/year versus routing everything to flagships. The entire engineering team and infra budget for a mid-size AI product is often less than that.

What routing actually costs

A router at this scale needs to handle:

  • Scoring latency: The router has to decide which model to use before the first token goes out. At high QPS, this decision has to be invisible to the end user. A good router does this with a lightweight classifier, not a second LLM call.
  • Routing overhead: A gateway at 10B tokens/month is handling millions of requests per month. That is a real infrastructure load.
  • Fallback orchestration: At scale, model endpoints go down regularly. The router needs circuit breakers and fallback paths.

For a managed router like HiWay2LLM, you pay a degressive markup on your API spend (not a per-request flat fee) that covers all of this. The markup decreases at each volume tier (12.5% below $500/mo, 11% at $500-$5K/mo, 10% at $5K-$20K/mo, Enterprise negotiated above). Crucially, the markup amount drops proportionally as smart routing cuts your provider bill - so the gateway's cost to you is always a small, shrinking fraction of the savings it generates.

The prompt caching layer

At 50-100B tokens/month, a second optimization becomes meaningful: prompt caching.

If you have system prompts or large context windows that repeat across requests, prompt caching (Anthropic, OpenAI) caches those tokens server-side and charges 75-90% less for cache hits. On a RAG pipeline with a 10K-token knowledge base that's repeated in every request, the effective token cost for the repeated portion drops from $3/1M to $0.30/1M.

Combined with smart routing, this can push total savings past 85% on suitable workloads.

The BYOK difference at scale

Here is the part that matters most at 50-100B tokens/month: are you paying wholesale or retail?

A reseller gateway (OpenRouter, etc.) buys inference at wholesale and sells it to you at a markup. At 100B tokens/month, even a 5% markup is $23,750/month in fees - $285,000/year - on top of the actual inference cost.

A BYOK gateway (HiWay) routes your calls through your own provider accounts at wholesale rates. You pay providers directly. The gateway charges a degressive markup on your API spend - not a per-token reseller cut. At scale, this is the structural difference that makes the billing math work: when smart routing cuts your provider bill, the markup amount shrinks proportionally too.

What to measure when evaluating a router

  1. Actual downgrade rate on your workload: Run a sample of your real requests through the router's classifier. What fraction would it downgrade? At what quality hit?
  2. Latency overhead: End-to-end latency with routing enabled vs direct API calls. A well-designed router adds negligible overhead.
  3. Fallback hit rate: How often does the primary model fail and a fallback fires?
  4. Cache hit rate: If you have repeated context, what's the cache hit rate after warm-up?
  5. Effective blended rate: Total token cost / total tokens, after routing and caching.
Start Saving →

No credit card required

Share

Was this useful?

Comments

Be the first to comment.