April 20267 min readJohan Bretonneau

What Prompt Caching Actually Costs
And Why Your Hit Rate Is Probably 20%

Prompt caching is a 90% discount on repeated context, the single biggest cost lever in the Anthropic API. But most teams run with a 20% hit rate and never realize it. Here's how to measure yours and fix it.

If your LLM bill has Anthropic on it and you haven't manually instrumented your prompt cache hit rate, you are almost certainly leaving 3-5× cost reduction on the table. That's not a guess. It's what we see on every team we audit.

Prompt caching is the single biggest cost lever in the Anthropic API. Used correctly, it takes your repeated input tokens from full price to 10% of full price, a 90% discount on the portion of your prompt that doesn't change. Used incorrectly, which is most teams, you pay the sticker price and wonder why your bill is so big.

Here's how it actually works, why your hit rate is worse than you think, and the specific changes that take hit rate from 20% to 90%.

How Prompt Caching Actually Works

When you mark a section of your prompt as cacheable, Anthropic hashes that section and stores its internal representation server-side for 5 minutes. If the exact same prefix shows up again within 5 minutes, the model doesn't re-process those tokens from scratch, it loads the cached state and continues from there.

Pricing changes dramatically:

  • Cache write: 1.25× normal input price (a one-time 25% premium)
  • Cache read: 0.1× normal input price (a 90% discount)
  • Cache TTL: 5 minutes (can be extended to 1 hour for a fee)

So the first call pays slightly more, and every subsequent call within 5 minutes pays 10% of input cost on the cached portion. If you're making many calls with the same system prompt, the math is brutal in your favor.

Except there's a catch: cache hits happen only when the prefix is byte-for-byte identical. That's where most teams trip up.

Why Your Hit Rate Is Worse Than You Think

Three patterns break caching silently:

1. You Put Dynamic Fields at the Top

The single most common mistake. Your system prompt probably looks like:

You are a helpful assistant for Acme Corp.
Current user: user_abc123
Current date: 2026-04-23 14:17:02
Session: sess_9f8a...

Your job is to help users with:
- Account questions
- Billing
- Technical issues
[... 1,500 more tokens of instructions ...]

Every field above the instructions is dynamic. User ID, timestamp, session ID. The moment any of those changes, the hash changes, the cache invalidates, and you pay full price for the next 2,000 tokens.

Fix: put dynamic fields at the end of the message, or in a separate user message. Put the stable instructions first.

Your job is to help users with:
- Account questions
[... 1,500 tokens of stable instructions ...]

---
Current user: user_abc123
Current date: 2026-04-23 14:17:02
Session: sess_9f8a...

Now the first 1,500 tokens cache. On the second request with the same instructions, you pay $0.30/M instead of $3/M on those tokens.

2. You Version-Bump Your System Prompt Casually

Teams iterate on prompts constantly. Every new version invalidates every cached prefix. If you ship a prompt change three times a day, your cache never lives long enough to be useful.

Fix: batch prompt changes into weekly releases. Between releases, the prefix is stable and can cache. If you're in fast-iteration mode, accept that caching won't help, but then at least turn it off so you don't pay the write premium unnecessarily.

3. You Skipped Cache Breakpoints

Anthropic's API lets you mark up to four cache breakpoints per request. If you set none, only the system prompt caches. But you probably have more static content: few-shot examples, tool definitions, a knowledge base snippet. All of that can cache too, if you mark it.

messages = [
    {"role": "system", "content": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}},
    {"role": "user", "content": KNOWLEDGE_BASE, "cache_control": {"type": "ephemeral"}},
    {"role": "user", "content": FEW_SHOTS, "cache_control": {"type": "ephemeral"}},
    {"role": "user", "content": current_question},
]

Four cache breakpoints. All of them are stable across requests. Only current_question isn't cached. Your effective per-request input cost collapses.

Measuring Your Actual Hit Rate

Anthropic returns cache statistics in every API response. They look like this:

{
  "usage": {
    "input_tokens": 52,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 2100,
    "output_tokens": 180
  }
}

Three numbers you should be tracking:

  • input_tokens, uncached input
  • cache_creation_input_tokens, paid 1.25× to write the cache
  • cache_read_input_tokens, paid 0.1× to read the cache

Your hit rate is:

hit_rate = cache_read_tokens / (cache_read_tokens + cache_creation_tokens + uncached_input)

If you're not logging these three fields per request, you are flying blind.

A healthy hit rate on a chatbot or agent with a stable system prompt should be 85-95%. If yours is below 70%, one of the three patterns above is biting you. If it's below 30%, all three probably are.

What a 90% Hit Rate Looks Like Dollar-wise

Let's run the math on a concrete case: a customer support bot, Sonnet 4.6, 5,000 conversations a day, 4 turns per conversation, 2,000-token system prompt, 500-token few-shot block, 300-token user message per turn.

Without caching:

  • Input tokens per call: 2,800 (system + few-shots + user)
  • Daily input cost: 5,000 × 4 × 2,800 × $3/M = $168/day

With caching at 30% hit rate (typical):

  • Cached portion: 0.3 × 2,500 tokens (system + few-shots) × $0.30/M = $0.00023 per call
  • Uncached portion: 0.7 × 2,500 + 300 = 2,050 tokens × $3/M = $0.00615 per call
  • Daily: 5,000 × 4 × $0.00638 = $127/day

With caching at 90% hit rate (what you should have):

  • Cached portion: 0.9 × 2,500 × $0.30/M = $0.000675 per call
  • Uncached portion: 0.1 × 2,500 + 300 = 550 tokens × $3/M = $0.00165 per call
  • Daily: 5,000 × 4 × $0.00232 = $46/day

That's $81/day saved, or $2,430/month, just by going from 30% to 90% cache hit rate. No code rewrite beyond prompt reordering and adding breakpoints.

The Gotchas Nobody Warns You About

A few weird edge cases we've run into:

System prompt size matters. Anthropic caches only if the marked section is >1024 tokens. Small system prompts are never cached even if you mark them. If your system prompt is 800 tokens, pad it to 1,100 with stable instructions, or accept it won't cache.

Tool definitions are cacheable but easy to break. If you generate tool schemas dynamically (common with MCP or plugins), the JSON ordering matters for hashing. Sort your keys alphabetically or you'll get cache misses from key-order drift.

The 5-minute TTL is real. Sporadic traffic (a chatbot that fires once every 10 minutes) will miss every cache. For low-volume endpoints, consider the 1-hour TTL (costs 2× the write premium but lives 12× longer), breakeven at ~6 cache reads per hour.

Cache is per-region. If Anthropic fails over your calls to a different region mid-session, you lose the cache. Not frequent, but it explains mysterious cache-miss bursts when you see provider incidents.

What To Do Monday Morning

Three actions, in order of ROI:

  1. Log the cache stats on every response. If you're not seeing the three fields in your metrics today, you can't optimize blind. This is a 15-minute fix.
  2. Reorder your system prompt to put dynamic fields last. This alone often takes hit rate from 30% to 75%.
  3. Mark explicit cache breakpoints on few-shot examples, tool defs, and any large stable context blocks.

If you do those three things, your Anthropic bill likely drops 40-60% this week, with zero user-facing change.

Start Saving →

No credit card required


This is the deep-dive companion to The Hidden Math of LLM Pricing, where cache failure is item #4 on the list of six hidden cost multipliers.

Share

Was this useful?

Comments

Be the first to comment.