May 20266 min readJohan Bretonneau

What 1,000 Agent Sessions Taught Us About LLM Routing
Building in public: the Agent Sessions panel

We built a live session monitor and 30-day analytics panel for agentic traffic. Here's what the data revealed, and why turns-per-session is the metric that actually matters.

You deploy an agent. You hit run. And then, nothing. It's alive somewhere, consuming tokens, making decisions, calling tools. You have logs. Maybe a token counter. But you have no idea how many turns that session actually made, whether the model changed mid-run, whether it diverged and recovered, or whether it's been stuck in a tool loop for the past 20 minutes.

This isn't a niche edge case. It's the default state of running agents in production.

So we built something to fix it.

The Blind Spot

When you operate a shared LLM gateway, one that handles traffic across multiple workspaces, dozens of API keys, and a mix of chat and agentic use cases, you develop a very specific kind of anxiety. You can see requests going in and responses coming out. You can see token counts. But an agent session isn't a request. It's a series of requests, bound together by context, intent, and tool calls, and none of the standard observability primitives tell you how that session is going.

The question that kept coming up internally was simple: is this session behaving normally, or is something wrong?

You can't answer that question from individual request logs. You can answer it if you track the session as a whole.

What We Built

Over the past few weeks, we shipped the Agent Sessions panel inside the HiWay2LLM dashboard. It has two layers.

The first is a live session monitor: sessions currently active, their turn count in progress, their divergence status, and whether any active session has crossed into territory that typically signals trouble. It refreshes in real time. When something looks off, you see it immediately, you don't find out in the next day's cost report.

The second is 30-day analytics across all agentic traffic for your workspace. This is where the interesting stuff lives.

The Metrics That Changed How We Think

Turns per session

This sounds obvious until you realize how rarely anyone tracks it.

Tokens measure volume. Turns measure behavior. An agent that does 4 turns of 10,000 tokens each is doing something completely different from an agent that does 40 turns of 1,000 tokens each, even if the total token count is the same.

Turns per session tells you how deeply your agents are reasoning before they converge on an answer. It tells you when a task is genuinely complex versus when an agent is spinning. It tells you whether your prompting strategy is tight or whether the model is wandering.

Once we started surfacing average turns per session alongside token costs, our read of agent performance changed completely. We stopped asking "how much does this agent cost?" and started asking "how many turns does it need, and why?"

Divergence rate and recovery rate

An agent diverges when it departs from expected behavior, context drift, circular tool calls, reasoning that stops converging. Divergence happens. The interesting question isn't whether it happens, but whether the agent recovers.

An agent that diverges and recovers in two turns is healthy. An agent that diverges and loops for 30 turns is a problem. These look identical in a raw token count. They look completely different in session analytics.

We track both the divergence rate (how often sessions go off track) and the recovery rate (of those, how many come back). A workspace with a 12% divergence rate and a 90% recovery rate is in great shape. A workspace with a 5% divergence rate and a 40% recovery rate has a structural problem, low failure rate, but when things go wrong, they really go wrong.

Watching this ratio over time tells you more about your agent's reliability than any individual session trace.

Model affinity

This one surprised us most. We expected workspaces to use different models for different tasks, Opus for heavy reasoning, Haiku for simple ops, maybe a mix depending on time of day or request complexity. That's the rational, cost-optimized behavior.

That's not what we saw.

Once a workspace finds a model that works for its use case, it stays there. The distribution across sessions tends to consolidate toward one or two models handling more than 80% of the traffic. This happens gradually, often without any explicit decision, engineers try a model, it works, they stop experimenting, and the affinity locks in.

Model affinity is a useful signal because it tells you whether your routing strategy is working as intended. If you're supposed to be using AGENT profile with automatic model selection but your sessions are all hitting the same endpoint, something in your configuration isn't matching your intent.

The Unexpected Signal: Turn Count as a Loop Detector

Here's the thing nobody talks about: the most reliable early signal of a tool loop isn't token count. It's not latency. It's not error rate.

It's an abnormally high turn count per session.

When a session's turn count climbs significantly past the baseline you observe in your normal traffic, it's almost always an agent that's stuck. The model keeps calling a tool, getting a result it doesn't know how to handle, and calling the tool again. Each turn adds turns. The session never terminates.

We learned this by looking at the data. Sessions that had been flagged manually as "something went wrong" clustered at turn counts far above the workspace's median. Not token-heavy, some were actually token-light because each tool call was cheap. But the number of back-and-forth cycles was the tell.

So we built an alert layer on top of it. When a live session's turn count crosses well above the observed normal range for that workspace, it surfaces in the monitor. Not as a hard block, Guardian handles that. As a flag: this session is behaving differently from your baseline. Worth checking.

The goal isn't to stop every long session. Some legitimate tasks genuinely take many turns. The goal is to separate "complex task" from "stuck agent," and the fastest way to do that is to compare against what normal looks like for your specific workspace, not a global threshold.

What It Changes

Analytics without action is just noise. The reason we built this panel is that the data feeds directly into routing recommendations.

When we see a workspace averaging 7 turns per session on a CHAT profile, we can surface a prompt: switching to AGENT profile would improve session coherence and reduce divergence rate. The analytics are watching. The routing logic can respond.

The loop closes: observe sessions, learn patterns, improve routing, measure again. That's the infrastructure we didn't have six months ago.

This also changes how we think about the product itself. Every metric we surface in analytics is a metric we can optimize for at the routing layer. Model affinity tells us where to apply intelligent model selection. Recovery rate tells us where to tighten divergence handling. Turn count distribution tells us how to set appropriate session-level guardrails.

None of this was visible when all we had was per-request logs.

Building in Public

One thing we didn't expect to learn: the model affinity pattern is essentially universal. We went in thinking agentic workspaces would show diverse model usage as they optimized for different tasks. What we found is that most workspaces are not optimizing at that granularity, they find something that works and they stick with it.

That's not a criticism. It's a workflow reality. Engineers are busy. If a model works, you don't keep experimenting. The insight for us is that the value of intelligent routing is higher than we thought, precisely because most teams aren't doing it manually. The infrastructure needs to do it for them.

The second thing we didn't expect: how fast the turn-count signal shows up in the data. We assumed we'd need to accumulate a lot of sessions before the distribution became meaningful. In practice, even modest traffic volumes produce a clear baseline quickly. The signal is stable.

That's what 1,000 agent sessions taught us. Not a list of obvious metrics. The shape of how agents actually behave when you watch them carefully, and what to do when the shape looks wrong.

Start Saving →

No credit card required

LinkedIn X Email

Was this useful?

Comments

…

Be the first to comment.