May 20263 min readJohan Bretonneau

Latency Routing vs Cost Routing vs Quality Routing
When to Pick Which

Three routing strategies, three different optimization targets. Here's exactly when to route by latency, by cost, or by quality - and the one mistake that makes all three worse.

There are three things you can route by: latency, cost, or quality. Most systems route by cost because cost is easy to measure. Latency is harder. Quality is hardest. And the one you should actually prioritize depends entirely on the workload - which is why defaulting to cost is usually wrong.

The Three Routing Strategies

Route by cost

Pick the cheapest provider that can handle the request above a quality floor.

When it works: Background jobs, batch processing, async pipelines. Any workload where the user isn't waiting for an answer right now. Document indexing, content moderation queues, nightly report generation.

When it fails: Any user-facing interface where waiting feels like a bug. At 900ms TTFT (Gemini Pro on a bad day), a chatbot feels broken even if the answer is correct.

The math: Routing everything by cost gives you roughly Gemini Flash for simple queries ($0.19/1M blended) and GPT-4o for complex ones ($6.88/1M). Average savings vs. Claude 3.5 Sonnet everywhere: ~65%.

Route by latency

Pick the fastest provider that meets a quality floor, regardless of cost.

When it works: Real-time chat, voice interfaces, autocomplete, live code suggestions. Any UX where the perceived speed is the product.

The p99 matters more than the median. Groq at 65ms median, 280ms p99 is better than GPT-4o at 620ms median even if Groq occasionally hits 400ms - because GPT-4o's p99 is 2,100ms. Groq's worst day beats GPT-4o's average for latency-critical paths.

The cost: Groq Llama 3.1 70B at $0.70/1M blended isn't the cheapest, but it's not expensive. The cost delta vs. quality-routing is usually worth it for any interaction where the user is actively waiting.

Route by quality

Pick the highest-scoring provider that fits within a cost cap.

When it works: Anything with a human reviewer in the loop. Legal drafts, medical notes, complex reasoning tasks where errors have real consequences. The user isn't waiting synchronously - they'll review the output later, and a 20% quality improvement is worth a 3× price increase.

The trap: Quality routing without a cap becomes "route everything to Claude 3.5 Sonnet," which is the default most teams accidentally end up at. Quality routing only makes sense if you've defined the cap.

The Real-World Routing Decision Tree

In practice, you're not choosing one strategy. You're stacking them:

1. Does this request need to complete in < 300ms for good UX?
   → YES: route by latency (Groq > Haiku > GPT-4o-mini)
   → NO: continue

2. Is this a background/async task?
   → YES: route by cost (Gemini Flash > GPT-4o-mini > Haiku)
   → NO: continue

3. Will a human review this output before it affects anything important?
   → YES: route by quality (Claude 3.5 Sonnet > GPT-4o > Gemini Pro)
   → NO: route by cost with a quality floor of 75+

Most teams short-circuit this at step 3 and send everything to Claude 3.5 Sonnet. The result: paying quality prices for latency-sensitive tasks where Groq would perform identically from the user's perspective, and for async tasks where Gemini Flash is 50× cheaper.

The Mistake That Makes All Three Worse

Routing after the request is already in flight.

If your routing decision happens at the proxy layer after the client has connected and is waiting, you've already paid the cost of the connection setup for your "fast" provider. The real latency win from Groq comes when the routing decision happens before the request reaches the provider - not mid-flight.

This means your routing layer needs to classify the request in < 1ms, before choosing a provider. Classification via a tiny local model (or a rule-based classifier on request metadata) is the right pattern. Classification via a second LLM call to "decide which model to use" is slower than just using Claude 3.5 Sonnet for everything.

HiWay2LLM's CORTEX layer classifies in < 0.4ms using a combination of token count heuristics and task-type rules derived from request patterns. No model call in the hot path.

Blended Routing: What Actually Ships

In production, most teams end up with a blended strategy:

  • Latency tier for interactive paths (chat, search, autocomplete) → Groq or Haiku
  • Cost tier for background paths (indexing, batch eval, nightly jobs) → Gemini Flash
  • Quality tier for agent planning steps and human-reviewed outputs → Claude 3.5 Sonnet

The routing rule doesn't need to be smart. It needs to be consistent. A simple map of endpoint → tier beats a fancy ML classifier for predictability and debuggability.

The result in our benchmarks: blended routing against these three tiers costs $2.40/1M on average vs. $9.75 for Claude 3.5 Sonnet everywhere - a 75% reduction with no perceptible quality loss on 90% of requests.

One Configuration That Works

Here's a starting point that works for most SaaS AI products:

Request typeProviderRationale
Chat messages < 200 tokensGroq Llama 3.1 70BSub-100ms, quality sufficient
Chat messages > 200 tokensGPT-4o-miniQuality jump for medium complexity
Agent planning / reasoningClaude 3.5 SonnetQuality matters, not latency
Document extractionGemini 2.0 FlashCheap, context-window OK
Long document analysisGemini 2.0 Pro1M context window
Code generationClaude 3.5 SonnetBest-in-class for code

This isn't the optimal configuration for every product. But it's better than the default, and it's auditable - you can see exactly why each request costs what it costs.

Start Saving →

No credit card required

Share

Was this useful?

Comments

Be the first to comment.