May 20266 min readJohan Bretonneau

Your LLM Gateway Doesn't Know You're Running an Agent
How stateless routing silently breaks multi-turn agents

Every LLM gateway routes each request in isolation. For a one-shot app, that's fine. For a multi-turn agent, it's a disaster waiting to happen. Here's how session-aware routing changes everything.

There's a class of bug that only shows up in production, with real workloads, after you've already shipped. It doesn't throw an exception. It doesn't return a 500. It just quietly makes your agent worse, and more expensive, over time.

The bug is this: your LLM gateway is stateless, and your agent isn't.

The Problem With Stateless Routing

When a request hits an LLM gateway, the gateway makes a routing decision: which model, which provider, based on cost and load and latency. Then it forwards the request and forgets about it. The next request starts from zero.

That model works perfectly for 90% of LLM use cases. A user sends a message, you get a response, done. Each call is independent. Stateless routing is fast, simple, and correct.

But agents are not independent calls. An agent is a conversation, a sequence of turns that build on each other. The context window at turn 5 contains everything from turns 1 through 4. The model's behavior at turn 5 depends on how it interpreted turns 1 through 4.

When the gateway routes turn 1 to claude-sonnet-3-7 and then routes turn 5 to claude-haiku-3-5 because the cluster is under load, you haven't just switched models. You've introduced a different intelligence into the middle of an ongoing reasoning process.

The agent was building a plan. Now a different model is reading that plan and deciding what to do next, a model with different strengths, different tendencies, different failure modes. The conversation context was written by one model and is now being interpreted by another.

What Actually Happens

Let me make this concrete. Say you're running a research agent: it fetches data, reasons over it, calls tools, builds up a response across multiple turns.

Turn 1. The gateway sees a fresh request, no conversation history. It routes to the best available model for the task, let's say claude-sonnet-3-7.

Turn 2. Sonnet generates a plan. It outlines 4 steps, makes some intermediate tool calls, and asks a clarifying question. The response is coherent and on-track.

Turn 3. A wave of traffic hits your gateway. Load balancing kicks in, and the cheaper tier becomes the smart routing choice. Your turn 3 goes to claude-haiku-3-5.

Turn 4. Haiku reads the accumulated context, a 12K-token conversation written by Sonnet, full of nuanced reasoning chains and partially-completed steps. Haiku does its best, but its interpretation is different. It misses a subtlety in step 3. It gives a shorter, flatter response.

Turn 5. Sonnet is back (load has dropped). It reads turn 4's response and tries to continue, but turn 4 deviated. Sonnet tries to reconcile. The agent loops. Costs spike.

No error thrown. No 500. Just an agent quietly losing coherence across model switches it never knew were happening.

The Fix: One Header

The simplest possible fix is also the right one: session locking via a conversation ID.

X-Conversation-ID: <your-session-id>

Send that header with every request in an agent's conversation. The gateway recognizes it and remembers: turn 1 went to model X, so turns 2 through N go to model X as well. The session is pinned.

No code changes to your agent logic. No schema migrations. No new SDK to learn. One HTTP header on every call.

In HiWay2LLM, we added this as part of the AGENT Router Profile. When you use the AGENT profile, the gateway enforces the session lock automatically, once a model is selected for a conversation ID, it stays selected for the lifetime of that conversation. It also enforces a quality floor, preventing cost-saving downgrades mid-task even under load.

Router Profiles: Matching Behavior to Workload

One insight we arrived at early: different workloads need fundamentally different routing behavior. A one-size-fits-all router is a compromise that's wrong for everyone.

We ended up with four core profiles:

CHAT. Interactive, latency-sensitive, lightweight. A human is waiting for the response. Optimize for perceived speed, prefer models with fast first-token latency, accept some variance in output depth. No session locking needed because each exchange is short-lived.

AGENT. Long-running, multi-turn, correctness-critical. A machine is orchestrating a complex task. Session lock is mandatory. Quality floor enforced. No model downgrade mid-session regardless of load.

BATCH. Async processing, not time-sensitive. A job queue is working through a large backlog. Latency doesn't matter, throughput and cost do. Maximize provider-side efficiency, spread load.

SAVINGS. Cost-first routing. You've set a hard budget constraint and you want every dollar squeezed out. The router finds the cheapest model that can handle the task, no quality floor, no frills.

The header to set a profile is just as simple:

X-Router-Profile: AGENT

You pick the profile once, at the system prompt or agent-initialization level. The gateway handles everything else.

Divergence Detection: The Layer Underneath

Session locking solves the model-switch problem. But there's a related failure mode that locking alone doesn't fix: the agent that stays on the same model but starts looping anyway.

This happens when the model gets into a state where it keeps repeating similar actions, re-running a tool call, re-asking the same question, backtracking to a step it already completed. The model is consistent, but the agent is stuck.

We call this divergence. When the AGENT profile detects it, a conversation pattern that's cycling without making progress, it can inject a corrective signal. Not a hard stop, not a kill-switch, but a small nudge: a system message that the gateway injects into the next turn to reorient the model.

We're deliberately not publishing the detection logic here. What matters is the observable behavior: agents using the AGENT profile diverge significantly less often than the same agents running without it. When they do diverge, recovery is automatic.

Why We Built This

The honest building-in-public answer: we needed it for our own agents at Mytm.ai.

We run agents for several internal workflows, research pipelines, content processes, document analysis. Early on, before we'd instrumented anything properly, we started noticing that some of our agents were producing subtly worse output on certain runs. Debugging was painful. The agents weren't crashing, they were just… drifting.

When we dug into the request logs, the pattern was obvious: the conversations where quality dropped were the ones where the model had switched mid-session. Not always, but consistently enough to matter.

The fix, X-Conversation-ID and a locked routing policy, was the simplest possible solution. We built it for ourselves first, shipped it internally, and watched the drift problem go away. Then we made it a first-class feature in HiWay2LLM because we knew every team running agents would hit the same wall.

The header was the obvious answer. The hard part was recognizing the problem existed at all.

In Practice

If you're running agents today through any LLM gateway, check whether it supports any form of conversation-level session affinity. Most don't. They were built for one-shot request routing, and that's a legitimate architectural choice for that use case.

But if your agents are doing anything complex, multi-turn reasoning, tool use, iterative planning, you need session awareness at the gateway layer, not just inside your application code.

The application-level workaround is to always pass model=claude-sonnet-3-7 (or whatever) explicitly and bypass the router's model-selection logic. That works, but you lose the benefits of intelligent routing: automatic fallback on provider outages, cost optimization between tasks, load balancing across your API keys.

The right solution is a gateway that understands the difference between a request and a conversation, and routes accordingly.

// Before: stateless, every turn re-routes independently
const response = await hiway.chat({
  messages: conversationHistory,
});

// After: session-aware, model locked for the conversation lifetime
const response = await hiway.chat({
  messages: conversationHistory,
  headers: {
    "X-Conversation-ID": sessionId,
    "X-Router-Profile": "AGENT",
  },
});

Two headers. No other changes. Your agent now has a consistent model identity for the entire conversation.

That's the whole fix. The surprising part isn't how complex it is. The surprising part is how few gateways offer it.

Start Saving →

No credit card required


Related reading: We Watched an AI Agent Burn $200 at 3AM, 5 LLM Cost Patterns at Scale.

Share

Was this useful?

Comments

Be the first to comment.