April 20266 min readJohan Bretonneau

We Watched an AI Agent Burn $200 at 3AM
Here's Guardian

A RAG agent stuck in a retry loop, a context window ballooning past 200K tokens, and the moment we realized no LLM provider alerts you in time. Here's the anti-loop system we built to stop this from ever happening again.

3AM cost escalation - from $0 to $201 in 32 minutes

Cumulative spend ($) over time - alert threshold shown in red

My phone buzzed at 3:14 AM. It was the Anthropic billing dashboard. I'd set a threshold alert at $100/day just to feel responsible. The alert said I'd just crossed it.

I was already at $143. By the time I opened my laptop, I was at $167. By the time I killed the process, I was at $201.

One agent. Three hours. Two hundred dollars.

Here's what happened, why no existing LLM safety net caught it, and what we built so it never happens again, to us, or anyone using HiWay2LLM.

The Timeline

2:47 AM. A scheduled job fires on our internal research agent. It's a RAG pipeline that reads a batch of documents, extracts insights, and writes a summary to a database. Normal nightly run. Expected cost: about $3.

2:49 AM. The first document it loads is malformed, a 800-page PDF that OCR'd into mostly garbage, but with one specific character pattern that triggered the agent's reasoning loop. The agent decides it needs to "re-read the document" to understand.

2:52 AM. It's re-reading the document. Every re-read ships the entire 800-page context to Claude Opus. Each call: ~180,000 tokens. At $15/M input, that's $2.70 per call. The agent is now calling once every 40 seconds.

3:01 AM. The agent hits a tool use error. Its tool result comes back empty. The agent "decides" to retry. The retry ships the full context again. Plus the previous tool result. Plus the new tool result. Context is now at 210,000 tokens per call.

3:14 AM. My phone buzzes. I've spent $143 of my own money on an agent re-reading garbage OCR output in a loop. The Anthropic alert email is, helpfully, 9 minutes old by the time I see it.

3:16 AM. I SSH in and kill the process manually. Final damage: $201.

Three hours of sleep, gone. Two hundred bucks, gone. And the scariest part? If my phone had been on silent, this would have run until 9 AM when I woke up. At that rate, that's another $600.

The Pattern Underneath

This is not a one-off. Once you work on LLM apps at any scale, you see the same class of problem over and over:

A function-calling agent loops on a tool error because it "doesn't understand" why the tool failed.
A chat application's context keeps growing because nobody capped conversation length.
A batch job retries on a transient 429, but each retry is now fatter than the last.
A development instance is left running overnight on a shared API key, with no one watching.
A health check (the one that burned us $40/day in our very first post) fires every 30 minutes against Opus.

The common thread is not "the code had a bug." The common thread is that no layer between your code and the provider is looking for suspicious patterns in real time. The provider sees individual API calls. Your application sees individual agent loops. Nobody sees the whole picture.

Why Existing Solutions Don't Work

You'd think provider-side billing alerts would handle this. They don't, and here's why:

They're threshold-based, not rate-based. Anthropic sent me an email when I crossed $100. Fine. But by the time I read the email, I was at $167. At $50/hour burn rate, thresholds don't help, you need rate alerts, and providers don't offer them.

They're email-only. You can't wire a billing alert to a kill-switch. At best, they notify a human. If the human isn't awake, the burn continues.

They're coarse-grained. You get one threshold per account. You can't say "alert me if any single conversation exceeds $5" or "kill any request with context over 180K tokens."

They're post-hoc. The billing system is aggregating usage in 15-minute windows, not real-time. By the time it decides you've crossed a threshold, you're already 15 minutes of burn past it.

So if you want to catch runaway spend in real time, you have to build it yourself. Or put it in the infrastructure layer.

What Guardian Does

Guardian is the system we built after that 3AM incident. It sits in the HiWay2LLM proxy and inspects every single request before it hits the upstream provider. It looks for four specific categories of trouble:

1. Loop Detection via Request Fingerprinting. Every request gets a fingerprint based on its intent - the user message, system prompt, and recent conversation tail. If the same fingerprint reappears too often in a short window, we block it.

In our 3AM incident, the agent was sending the same "re-read the document and extract insights" prompt with a slightly longer context each time. The user message was identical. Fingerprinting catches it quickly, before the damage compounds.

Toggleable. You configure the tolerance threshold based on your use case.

2. Context Size Throttling. We count tokens on the way in. Three progressive levels: a warning logged to your dashboard, throttle with exponential backoff, and a hard block requiring manual acknowledgment.

Our agent was sending contexts far beyond any reasonable limit. Guardian would have blocked before the first expensive call ever reached Anthropic.

3. Zombie Agent Detection. If an API key is firing requests outside configured business hours, with no human-interaction signal, Guardian flags it. You opt in per key: "this key is for nightly batch, it's allowed to fire at 3AM" or "this key is for my chatbot, if it's firing at 3AM something's wrong."

4. Cost Spike Alerting. We track rolling hourly burn rate per key. If the current hour spikes well above your usual average, Guardian can:

Send a webhook (Slack, PagerDuty, your own endpoint).
Auto-throttle the key.
Hard-kill all further requests until you re-enable.

All three actions are independently toggleable. You pick how paranoid you want to be.

Why Fingerprinting Works

The key insight: we don't fingerprint the full context. An agent's conversation accumulates tokens over time, so a full-context fingerprint would never match twice. We fingerprint the intent - the stable part of the request - even when the surrounding conversation grows.

We also normalize aggressively to ignore volatile details (timestamps, session IDs) that would invalidate the fingerprint and cause us to miss the loop.

The concept is simple. The work was figuring out which parts to fingerprint, not the algorithm itself.

Why It's Best Handled at the Infra Layer

Protection like this only works reliably when it lives in front of every request, with visibility into the whole traffic shape - not stitched into each app's retry path. A shared router is the natural place for it:

Deduplication needs global state across every client.
Burn-rate math only makes sense when you see all the spend.
Kill-switches have to interrupt in-flight calls, not just throttle the next one.
Thresholds evolve as model prices shift; you don't want that tuning scattered across apps.

You can absolutely build a narrow version inside a single app. Teams do, and it works until the traffic shape changes. Moving the concern one layer down means every app on HiWay2LLM inherits it without writing a line.

The Fix That Stuck

After deploying Guardian internally, we replayed the 3AM incident against the new proxy in staging. Guardian blocked call #3. Total damage in the simulated replay: $8.10 instead of $201.

We haven't had a runaway agent incident since. Not one. In the first month, Guardian blocked:

3 health-check loops (total would-be cost: $340)
12 context bloat events over 150K tokens
1 zombie dev-env agent that started retrying at 2 AM on a Saturday

That's real money saved on real patterns, in one month, inside one company. The moment you have it running, you realize every team should have this. No provider offers it. No reseller SaaS offers it. You can either build it or have it.

If you're running LLM calls in production, you need this layer. Not eventually, now. The 3AM call you don't want to get is the 3AM call you didn't know was possible.

Start Saving →

No credit card required

LinkedIn X Email

Was this useful?

Comments

…

Be the first to comment.