April 20266 min readJohan Bretonneau

The Hidden Math of LLM Pricing
Why Your Bill is 3x What You Think

Providers advertise $3/M tokens. You pay $8/M effective. Here's where the delta hides, system prompts, reasoning tokens, retry loops, failed caching, and how to measure it.

Nominal price vs what you actually pay

Indexed cost per useful output unit - same workload, same model

Here's a number that should bother you: the sticker price on LLM API pricing pages is almost never what you actually pay per useful output.

Claude Sonnet 4.6 is listed at $3 per million input tokens. Last month, we instrumented a mid-sized RAG app and measured the effective cost: $8.40 per million input tokens of user-visible output. That's 2.8× the sticker price. And this was a well-written app by a team that thought they knew what they were doing.

If you're building with LLMs and haven't run this number on your own stack, your bill is almost certainly bigger than you think. Here's exactly where the money leaks.

The Sticker Price Lie

When a provider tells you "$3 per million input tokens," that number is accurate, for the raw API call. The problem is that "tokens consumed" and "tokens that produced user value" are two wildly different numbers.

Six multipliers compound quietly between them:

System prompt repetition
Context accumulation in multi-turn conversations
Reasoning tokens you can't see
Failed prompt caching
Retry loops on transient errors
Tool use feedback expansion

Each one looks small in isolation. Stack them together and you get 2-5× your "expected" bill. Let's break them down.

1. The System Prompt Tax

Every single call you make ships your system prompt again. If your system prompt is 2,000 tokens (realistic, one carefully-engineered persona plus instructions plus few-shot examples), and your user message is 50 tokens, 97.5% of your input is the system prompt.

The user sent 50 tokens. You're billed for 2,050.

A chat app with a 2K system prompt and an average 100-token user turn spends 95% of its input budget on the system prompt. Every turn. Forever.

Prompt caching solves this, but only if you use it correctly (more on that in section 4).

2. The Conversation Accumulator

Multi-turn chat is a compound interest problem. Turn 1 costs X. Turn 2 costs X + (turn 1). Turn 3 costs X + (turn 1) + (turn 2). By turn 10 of a modest conversation, you're paying for the same early tokens ten times.

Illustration on a typical chatbot with a 2,000-token system prompt and messages averaging around 200 tokens:

Turn	Input tokens billed (approx.)
1	~2,200
5	~9,000
10	~20,000
20	~55,000
30	~100,000

Turn 30 can cost 40-50× what turn 1 did. Same user. Same kind of question. The early turns of the conversation are paid for again and again with every subsequent exchange.

Most teams never cap conversation length, because the UX penalty is visible and the cost penalty is invisible until the monthly bill.

3. The Invisible Reasoning Tokens

This one got us the worst.

Claude's extended thinking and OpenAI's o1-style reasoning models produce tokens that you never see but you always pay for. These are the model's internal scratchpad, the "thinking" it does before writing the answer.

For a complex prompt with extended thinking enabled, the reasoning tokens can be 3-5× the visible output tokens. You ask a 200-token question, the model produces 400 tokens of answer, and you've just paid for 2,000 tokens of reasoning you never read.

# What your dashboard shows
input_tokens = 200
output_tokens = 400
# "Cost: $0.018"

# What you actually paid
input_tokens = 200
reasoning_tokens = 1,800   # you can't see these
output_tokens = 400
# "Actual cost: $0.045"

Reasoning tokens are billed at output rate ($15/M for Opus). They compound with everything else in this list. If you flip on extended thinking without measuring the delta, your bill doubles quietly.

4. Prompt Caching That Doesn't Cache

Prompt caching is the single biggest cost lever most teams don't use properly. Anthropic's cache reads are 10% of normal input price. A 90% discount, ignored.

Why? Three reasons we see constantly:

Cache misses from drift. Change one character in the system prompt, a version bump, a timestamp, a user-specific field inserted at the top, and the cache invalidates. Teams don't realize their cache hit rate is 20% instead of the 95% they assumed.
Too-short TTL. The cache expires in 5 minutes. A chatbot with sporadic usage never gets a hit.
Wrong cache breakpoints. You can place cache markers manually, but most teams don't, so the SDK picks defaults that aren't optimal for their workload.

A simple fix, putting user-specific fields at the end of the prompt instead of the top, can take you from 20% to 90% cache hit rate. That's a 4-5× cost reduction on its own, and it requires zero code rewriting beyond prompt ordering.

5. The Retry Tax

LLM APIs fail. Rate limits, 529s, timeouts, content filter flags. Most SDKs retry automatically with exponential backoff.

Every retry is billed.

A call that succeeds on the third attempt costs you 3× what a clean call costs. Most teams don't track retry rate per endpoint. In one case we diagnosed, a team's "occasional Anthropic outages" were actually 8% of their calls retrying twice, a silent 16% cost overhead they never saw in logs, only in the bill.

6. The Tool Use Explosion

Function calling, "tool use" in Claude parlance, is where costs really get weird. Each tool call is a round trip: the model generates a tool request, you execute it, you send the result back as a new message, the model sees it and generates the next step.

Every round trip ships the full conversation again. An agent that makes 5 tool calls to answer one question pays for its entire context 5 times, plus tool results.

User question (200 tok)
  → LLM thinks + requests tool A (400 tok output, 200 tok input)
  → Tool A result appended (300 tok)
  → LLM reads full context + requests tool B (700 tok input, 200 tok output)
  → Tool B result appended (500 tok)
  → LLM reads full context + writes answer (1,400 tok input, 300 tok output)

What the user sees: one question, one answer. What you paid for: 2,300 tokens of input + 900 tokens of output on a question that was originally 200 tokens long.

Doing the Real Math

Let's apply all six multipliers to a realistic scenario: a customer support chatbot, Claude Sonnet 4.6, 10,000 conversations a day.

Naive calculation:

Average conversation: 4 turns, 500 tokens each → 2,000 input tokens, 400 output
Daily cost: 10,000 × (2,000 × $3/M + 400 × $15/M) = $120/day

Actual cost with all multipliers:

System prompt (typical size) repeated at every turn: thousands of extra tokens per conversation
Conversation accumulation: tokens that compound with every exchange
Unoptimized cache hit rate: effective discount well below the theoretical 90%
Non-zero retry rate: invisible surcharge on every failing call
Multi-step tool use: each tool round-trip resends the full context

In uninstrumented apps, the result typically lands somewhere between 2× and 3× the naive estimate.

Multiplied over a month, that delta often adds up to several thousand dollars more than budgeted. For a team that thought they had this figured out.

How To Actually Measure This

You can't fix what you don't measure. Three instrumentations I'd argue every LLM-heavy app should have:

1. Effective cost per conversation, not per token. Divide your total spend by the number of user-visible conversations completed. That's your real number.

2. Cache hit rate, broken down by endpoint. If it's below 80% on a high-traffic endpoint, you have a 3-5× cost leak that's easy to fix with prompt reordering.

3. Retry rate by error code. 529s and 429s are different problems, and they often indicate you're on the wrong tier, not that the provider is flaky.

If you're not tracking these, your bill is being decided by patterns you can't see.

The Routing Angle

Once you can see the real cost per conversation, the next question becomes: was the model I used actually required?

This is where smart routing pays off. Most conversations, greetings, short factual questions, simple lookups, don't need the top-tier model. A router that sends those to Haiku while keeping complex reasoning on Sonnet/Opus can cut the effective bill by 50-70% without the user noticing.

The hidden math isn't inevitable. It's just hidden.

Start Saving →

No credit card required

Next in the series: why bringing your own API keys (BYOK) is the cleanest fix for the pricing opacity problem.