May 20265 min readJohan Bretonneau

The Silent Burn: A Zombie Agent Ran for 4 Days Before I Noticed
Why a loud failure is easier to catch than a quiet one

An LLM agent I'd forgotten about retried the same call 44 times in 96 hours, all silent, all unnoticed. Here's the autopsy, the cost in alternate timelines, and why silent burn is harder to catch than the 3AM kind.

Token burn rate in a zombie retry loop

Cumulative spend ($) over time - alert threshold shown in red

A PowerShell window I'd minimized 5 days ago. An orange line near the top. A timestamp from a hostname I no longer use, because I'd renamed my dev machine since.

That's how I found out a local LLM agent had been running in a tight retry loop for four straight days. Forty-four identical timeouts, fired by the agent's own 30-minute heartbeat, into a 32K context that always overflowed. The median gap between failures was 1800 seconds, exact. Not a human. A loop.

If that loop had been pointed at a paid provider instead of a local model, I'd be writing a different post. With a bigger number.

The autopsy

The agent was running on my own machine, calling a local 32B model through Ollama. Every 30 minutes a heartbeat fired. Every 30 minutes the heartbeat shipped a prompt that maxed out the context window. Every 30 minutes the model timed out before it could generate a meaningful response. The agent logged an internal error and gave up. Then waited 30 minutes. Then did it again.

Here's the failure pattern, parsed out of the logs:

MetricValue
Total timeouts44
Time span96 hours (4 days)
Median gap between events1800 s, exact
Successful generations0
Tokens shipped to the model~1.4M (input prefill, no output)
Cron jobs configured0
Active client connections0
Tasks tracked in the run database0

Zero connections, zero successful runs, zero tasks. The agent was performing a private ritual to itself. Forty-four times.

What it would have cost on an API

The same 44 calls - same prompt size, same retry pattern - at typical 2026 prices, no optimization:

Backend tierInput priceCost over 4 days
Cheapest small (~$1/M input)$1.40~$1.43
Mid-tier (~$3/M input)$4.20~$4.28
Top-tier (~$15/M input)$21.00~$21.38

Now turn on native prompt caching at the provider level:

BackendNo cacheNative prompt cache (~80% hit)
Mid-tier$4.28~$1.20
Top-tier$21.38~$5.80

Now turn on semantic caching at the gateway level. This is where the math gets interesting. The 44 retries were near-identical: the same heartbeat shipping the same prompt every 30 minutes. A semantic cache in front of the provider returns the first response to all subsequent matching calls, without ever round-tripping to the model:

SetupMid-tier cost over 4 days
Direct API, no cache$4.28
Direct API + native prompt cache$1.20
Gateway semantic cache (1 real call + 43 cached)~$0.15

That's a 96% reduction versus the direct, naive setup. On a workload that, by definition, was a tight loop on the same prompt - exactly the workload semantic caching was designed for.

Now picture that loop running not for 4 days against a free local model, but for 4 months against a paid one. The 96% becomes the difference between a $20 mistake and a $500 mistake.

Loud burn is easy. Silent burn is the real problem.

We've written before about the night an agent burned $200 in 3 hours. That's the loud kind. It trips the provider's threshold alert, your phone buzzes, you SSH in, you kill it, you eat the cost.

The silent kind is harder.

Silent burn never trips a threshold. The hourly burn rate stays low - a few cents an hour at most. No alert ever fires. The provider sees normal traffic. Your dashboards (if you only watch totals) show a flat line. The only signal that something is wrong is the shape of the requests: same prompt, same approximate token count, same timeout, on a perfect 30-minute cadence, for weeks.

A human looking at one or two of those calls in isolation would not flag them. But:

  • No human looks at calls in isolation in production. Nobody is reading the 14,000th request of the day.
  • Daily and weekly summaries average it away. A loop that costs $0.40/day disappears into the variance of the legitimate traffic next to it.
  • The "anomaly" is the absence of variation. Standard anomaly detection looks for spikes. Silent burn is the opposite: the same boring drumbeat, forever.

The 3AM agent that burned $200 woke me up. The 4-day zombie did not, and would not have, ever.

What the infra layer can do that your code can't

The fix isn't more application-level retry logic. The fix is a layer that sees the shape of the traffic and reacts to suspicious shapes, not just suspicious totals.

Three patterns work, and they stack:

Semantic deduplication. Identify the user-intent fragment of the prompt and compare it against recent calls. If the same intent repeats suspiciously, the gateway either short-circuits to the cached response, or rejects, or alerts. The hard part is getting the intent isolation right: too broad and you miss matches, too narrow and every variation breaks detection. Get it right and the cache hit rate on retry-heavy workloads goes from "occasional" to "almost everything".

Loop-detection alerts. Independent of the cache, fire an alert when the same fingerprint repeats in a regular cadence. On my agent, identical hashes at fixed intervals would have alerted very early - within an hour or two of starting, not 4 days in.

Scheduled silent-tier audits. A weekly cron that emails you the top 10 most-repeated request fingerprints across all your keys. Loud abuse appears in the spike chart. Silent abuse appears here. Most teams never look at the bottom of the distribution because nothing is on fire there.

None of this needs to live in your application. It belongs in the infrastructure layer, in front of every request, with global visibility across keys and clients.

Closing

I closed the PowerShell window after I killed the process. The agent is paused, not deleted; he'll come back when I have a real job for him. But I added a note to the runbook: if this thing starts again, it gets a fingerprint check and an alert on every match, before it's allowed to call anything.

Loud burn buys you a $200 lesson and a phone call at 3 AM. Silent burn doesn't even buy you that. It just runs.

If you're shipping LLM agents to production in 2026, the loud kind is the one that gets blogged about. The silent kind is the one you should actually fear.

Start Saving →

No credit card required


Related reading: We Watched an AI Agent Burn $200 at 3AM, Prompt Caching Hit Rate.

Share

Was this useful?

Comments

Be the first to comment.