Tokens Are the Wrong Unit
A Case for Per-Conversation Pricing
Every LLM provider prices by tokens, and every customer has no idea what a token costs for their specific app. Why per-token pricing is the worst unit economics model in modern infrastructure, and what should replace it.
Imagine you walk into a restaurant and the menu lists prices by gram of food. Not by dish. Not by portion. By gram.
You don't know how much a steak weighs. You don't know how much the sauce adds. The waiter can't tell you how much dinner will cost until after you've eaten it, when they weigh your leftovers and do the subtraction.
That's what LLM pricing looks like. Tokens are grams. You're the diner. And the bill is always a surprise.
The Unit Problem
Every major LLM provider prices by tokens: $3 per million input tokens, $15 per million output tokens, variable reasoning token surcharges. This is how the API works, so it's how the pricing works.
The problem is that nobody building an app cares about tokens. What they care about:
- How much does a conversation cost?
- How much does a user cost per month?
- How much does running my product cost per transaction?
Converting token pricing into these answers requires math that depends on:
- The exact shape of your prompts (different apps = wildly different per-token shapes)
- Your cache hit rate (which varies weekly based on prompt changes)
- Your retry rate (which varies by provider health)
- Your tool use patterns (which expand token counts non-linearly)
- Your extended thinking usage (invisible reasoning tokens you still pay for)
None of these are predictable at signup. You discover them over months of running production traffic. By the time you can answer "how much does a conversation cost?", you've already paid for the answer.
Why Tokens Persist as the Unit
The honest reason: it's convenient for the provider.
Tokens are how the model actually processes text. Tokens are what the GPUs bill compute against. From Anthropic's or OpenAI's perspective, pricing by token is the natural abstraction. They're passing through their raw input unit.
But raw input units are rarely the right pricing unit for a customer. Electricity providers don't bill by electron count. Cloud providers bill by instance-hour, not by CPU cycle. Stripe charges per transaction, not per byte of API request. In every case, the industry matured toward a unit that corresponded to customer value, not to the provider's internal operations.
LLM pricing is stuck at the "electron count" stage. It'll mature. The question is whether customers wait or push for it.
What Per-Conversation Pricing Would Look Like
Imagine instead: "$0.04 per completed conversation, for typical customer-support traffic."
Suddenly:
- You can budget. 10,000 conversations × $0.04 = $400. Done.
- You can compare providers on apples-to-apples units.
- You can build your own product's pricing on top of this (charge your customer per conversation, mark up by whatever margin you like).
- You can detect anomalies. If a conversation suddenly costs $0.40, something is wrong with that conversation, and you see it immediately.
"Ah," you say, "but conversations vary wildly, a three-turn chat costs differently from a 30-turn chat." Right. So you tier it. The SaaS playbook for this is thirty years old:
- Simple conversation ($0.02): <4 turns, <5K tokens total
- Standard ($0.05): <10 turns, <20K tokens
- Complex ($0.15): <30 turns, <80K tokens
- Enterprise ($0.50): anything beyond
Stripe does not charge you by byte. Stripe classifies transactions by type and prices accordingly. This is solved.
The Pushback
When I talk about this with technical founders, three objections come up repeatedly:
"Different conversations have wildly different costs." Yes, and that's why you have tiers. Stripe charges a different rate on international vs domestic cards. AWS charges a different rate on compute-optimized vs general-purpose instances. The variability is the feature, not a bug.
"Token pricing is more transparent." This one is backwards. Token pricing is transparent to the provider, not to you. You can see exactly how many tokens the provider processed, but you can't easily translate that to "how much did this user session cost." Per-conversation pricing is opaque at the token level but transparent at the level you actually care about.
"My app is a weird edge case and wouldn't fit the tiers." Maybe. So negotiate a custom tier. Every infrastructure vendor has enterprise deals for unusual workloads. That's normal.
Where This Is Already Happening
A few providers are edging toward this model. Replicate charges per generation for image models. Perplexity has tiered API pricing for search queries. Some of the newer code-assistant products (Cursor, Aider) effectively price by session, bundling compute underneath.
The pattern is always: the commodity API prices by raw unit, the layer on top prices by customer-meaningful unit. Which is why the infrastructure layer exists.
The BYOK Angle
This is also why BYOK + flat-fee infrastructure is such a good match. The provider still charges you by token (because that's their model). But your middleware can abstract that into a per-conversation or per-seat fee that actually makes sense for your product.
Your infrastructure layer takes on the prediction burden. It says, "most of your conversations fit tier B, which we'll bill at $0.05 each. We'll eat the variance; you get predictability." That's a real service. That's what a flat-fee subscription is buying you.
Providers will never offer this themselves because predicting variable per-conversation cost against their own variable per-token revenue is a hedge they can't take. Only a layer on top can.
The Future
I'd bet that in three years, serious LLM APIs will offer per-conversation or per-session pricing as a primary option, with per-token as a fallback for batch/pipeline users. Infrastructure layers will increasingly offer it by default. The grams-of-food billing model will feel as primitive in 2030 as pay-per-minute long-distance feels in 2026.
Until then, the unit you care about, conversations, users, sessions, has to be reconstructed from the unit you're billed in. That reconstruction is a full-time job, and it's the job a BYOK infrastructure layer does for you.
Tokens are the provider's unit. Conversations are yours. Until pricing converges to the customer unit, you need a layer that does the conversion.
No credit card required
Related: The Hidden Math of LLM Pricing shows why the conversion from token-cost to conversation-cost is harder than it looks.
Was this useful?
Comments
Be the first to comment.