The Hidden Math of LLM Pricing
Why Your Bill is 3x What You Think
Providers advertise $3/M tokens. You pay $8/M effective. Here's where the delta hides, system prompts, reasoning tokens, retry loops, failed caching, and how to measure it.
Here's a number that should bother you: the sticker price on LLM API pricing pages is almost never what you actually pay per useful output.
Claude Sonnet 4.6 is listed at $3 per million input tokens. Last month, we instrumented a mid-sized RAG app and measured the effective cost: $8.40 per million input tokens of user-visible output. That's 2.8× the sticker price. And this was a well-written app by a team that thought they knew what they were doing.
If you're building with LLMs and haven't run this number on your own stack, your bill is almost certainly bigger than you think. Here's exactly where the money leaks.
The Sticker Price Lie
When a provider tells you "$3 per million input tokens," that number is accurate, for the raw API call. The problem is that "tokens consumed" and "tokens that produced user value" are two wildly different numbers.
Six multipliers compound quietly between them:
- System prompt repetition
- Context accumulation in multi-turn conversations
- Reasoning tokens you can't see
- Failed prompt caching
- Retry loops on transient errors
- Tool use feedback expansion
Each one looks small in isolation. Stack them together and you get 2-5× your "expected" bill. Let's break them down.
1. The System Prompt Tax
Every single call you make ships your system prompt again. If your system prompt is 2,000 tokens (realistic, one carefully-engineered persona plus instructions plus few-shot examples), and your user message is 50 tokens, 97.5% of your input is the system prompt.
The user sent 50 tokens. You're billed for 2,050.
A chat app with a 2K system prompt and an average 100-token user turn spends 95% of its input budget on the system prompt. Every turn. Forever.
Prompt caching solves this, but only if you use it correctly (more on that in section 4).
2. The Conversation Accumulator
Multi-turn chat is a compound interest problem. Turn 1 costs X. Turn 2 costs X + (turn 1). Turn 3 costs X + (turn 1) + (turn 2). By turn 10 of a modest conversation, you're paying for the same early tokens ten times.
Real numbers from one of our chatbots:
| Turn | Cumulative input tokens | Input tokens billed |
|---|---|---|
| 1 | 2,150 | 2,150 |
| 5 | 8,400 | 8,400 |
| 10 | 19,200 | 19,200 |
| 20 | 52,000 | 52,000 |
| 30 | 98,000 | 98,000 |
Turn 30 costs 45× what turn 1 did. Same user. Same kind of question. The early turns of the conversation are paid for 30 times by the time we hit turn 30.
Most teams never cap conversation length, because the UX penalty is visible and the cost penalty is invisible until the monthly bill.
3. The Invisible Reasoning Tokens
This one got us the worst.
Claude's extended thinking and OpenAI's o1-style reasoning models produce tokens that you never see but you always pay for. These are the model's internal scratchpad, the "thinking" it does before writing the answer.
For a complex prompt with extended thinking enabled, the reasoning tokens can be 3-5× the visible output tokens. You ask a 200-token question, the model produces 400 tokens of answer, and you've just paid for 2,000 tokens of reasoning you never read.
# What your dashboard shows
input_tokens = 200
output_tokens = 400
# "Cost: $0.018"
# What you actually paid
input_tokens = 200
reasoning_tokens = 1,800 # you can't see these
output_tokens = 400
# "Actual cost: $0.045"
Reasoning tokens are billed at output rate ($15/M for Opus). They compound with everything else in this list. If you flip on extended thinking without measuring the delta, your bill doubles quietly.
4. Prompt Caching That Doesn't Cache
Prompt caching is the single biggest cost lever most teams don't use properly. Anthropic's cache reads are 10% of normal input price. A 90% discount, ignored.
Why? Three reasons we see constantly:
- Cache misses from drift. Change one character in the system prompt, a version bump, a timestamp, a user-specific field inserted at the top, and the cache invalidates. Teams don't realize their cache hit rate is 20% instead of the 95% they assumed.
- Too-short TTL. The cache expires in 5 minutes. A chatbot with sporadic usage never gets a hit.
- Wrong cache breakpoints. You can place cache markers manually, but most teams don't, so the SDK picks defaults that aren't optimal for their workload.
A simple fix, putting user-specific fields at the end of the prompt instead of the top, can take you from 20% to 90% cache hit rate. That's a 4-5× cost reduction on its own, and it requires zero code rewriting beyond prompt ordering.
5. The Retry Tax
LLM APIs fail. Rate limits, 529s, timeouts, content filter flags. Most SDKs retry automatically with exponential backoff.
Every retry is billed.
A call that succeeds on the third attempt costs you 3× what a clean call costs. Most teams don't track retry rate per endpoint. In one case we diagnosed, a team's "occasional Anthropic outages" were actually 8% of their calls retrying twice, a silent 16% cost overhead they never saw in logs, only in the bill.
6. The Tool Use Explosion
Function calling, "tool use" in Claude parlance, is where costs really get weird. Each tool call is a round trip: the model generates a tool request, you execute it, you send the result back as a new message, the model sees it and generates the next step.
Every round trip ships the full conversation again. An agent that makes 5 tool calls to answer one question pays for its entire context 5 times, plus tool results.
User question (200 tok)
→ LLM thinks + requests tool A (400 tok output, 200 tok input)
→ Tool A result appended (300 tok)
→ LLM reads full context + requests tool B (700 tok input, 200 tok output)
→ Tool B result appended (500 tok)
→ LLM reads full context + writes answer (1,400 tok input, 300 tok output)
What the user sees: one question, one answer. What you paid for: 2,300 tokens of input + 900 tokens of output on a question that was originally 200 tokens long.
Doing the Real Math
Let's apply all six multipliers to a realistic scenario: a customer support chatbot, Claude Sonnet 4.6, 10,000 conversations a day.
Naive calculation:
- Average conversation: 4 turns, 500 tokens each → 2,000 input tokens, 400 output
- Daily cost: 10,000 × (2,000 × $3/M + 400 × $15/M) = $120/day
Actual cost with all multipliers:
- System prompt: 1,500 tokens × 4 turns = 6,000 repeated tokens
- Conversation accumulation: +40% input tokens by turn 4
- Cache hit rate: 30% (unoptimized) → effective rate $2.40/M on the cached portion, $3/M on the rest
- Retry rate: 5% of calls retry once
- Tool use (2 tools avg per conversation): +60% input tokens
Revised daily cost: $312/day. That's 2.6× the naive estimate.
Monthly delta: $5,760 more than budgeted. For a team that thought they had this figured out.
How To Actually Measure This
You can't fix what you don't measure. Three instrumentations I'd argue every LLM-heavy app should have:
1. Effective cost per conversation, not per token. Divide your total spend by the number of user-visible conversations completed. That's your real number.
2. Cache hit rate, broken down by endpoint. If it's below 80% on a high-traffic endpoint, you have a 3-5× cost leak that's easy to fix with prompt reordering.
3. Retry rate by error code. 529s and 429s are different problems, and they often indicate you're on the wrong tier, not that the provider is flaky.
If you're not tracking these, your bill is being decided by patterns you can't see.
The Routing Angle
Once you can see the real cost per conversation, the next question becomes: was the model I used actually required?
This is where smart routing pays off. Most conversations, greetings, short factual questions, simple lookups, don't need the top-tier model. A router that sends those to Haiku while keeping complex reasoning on Sonnet/Opus can cut the effective bill by 50-70% without the user noticing.
The hidden math isn't inevitable. It's just hidden.
No credit card required
Next in the series: why bringing your own API keys (BYOK) is the cleanest fix for the pricing opacity problem.
Was this useful?
Comments
Be the first to comment.