We Watched an AI Agent Burn $200 at 3AM
Here's Guardian
A RAG agent stuck in a retry loop, a context window ballooning past 200K tokens, and the moment we realized no LLM provider alerts you in time. Here's the anti-loop system we built to stop this from ever happening again.
My phone buzzed at 3:14 AM. It was the Anthropic billing dashboard. I'd set a threshold alert at $100/day just to feel responsible. The alert said I'd just crossed it.
I was already at $143. By the time I opened my laptop, I was at $167. By the time I killed the process, I was at $201.
One agent. Three hours. Two hundred dollars.
Here's what happened, why no existing LLM safety net caught it, and what we built so it never happens again, to us, or anyone using HiWay2LLM.
The Timeline
2:47 AM. A scheduled job fires on our internal research agent. It's a RAG pipeline that reads a batch of documents, extracts insights, and writes a summary to a database. Normal nightly run. Expected cost: about $3.
2:49 AM. The first document it loads is malformed, a 800-page PDF that OCR'd into mostly garbage, but with one specific character pattern that triggered the agent's reasoning loop. The agent decides it needs to "re-read the document" to understand.
2:52 AM. It's re-reading the document. Every re-read ships the entire 800-page context to Claude Opus. Each call: ~180,000 tokens. At $15/M input, that's $2.70 per call. The agent is now calling once every 40 seconds.
3:01 AM. The agent hits a tool use error. Its tool result comes back empty. The agent "decides" to retry. The retry ships the full context again. Plus the previous tool result. Plus the new tool result. Context is now at 210,000 tokens per call.
3:14 AM. My phone buzzes. I've spent $143 of my own money on an agent re-reading garbage OCR output in a loop. The Anthropic alert email is, helpfully, 9 minutes old by the time I see it.
3:16 AM. I SSH in and kill the process manually. Final damage: $201.
Three hours of sleep, gone. Two hundred bucks, gone. And the scariest part? If my phone had been on silent, this would have run until 9 AM when I woke up. At that rate, that's another $600.
The Pattern Underneath
This is not a one-off. Once you work on LLM apps at any scale, you see the same class of problem over and over:
- A function-calling agent loops on a tool error because it "doesn't understand" why the tool failed.
- A chat application's context keeps growing because nobody capped conversation length.
- A batch job retries on a transient 429, but each retry is now fatter than the last.
- A development instance is left running overnight on a shared API key, with no one watching.
- A health check (the one that burned us $40/day in our very first post) fires every 30 minutes against Opus.
The common thread is not "the code had a bug." The common thread is that no layer between your code and the provider is looking for suspicious patterns in real time. The provider sees individual API calls. Your application sees individual agent loops. Nobody sees the whole picture.
Why Existing Solutions Don't Work
You'd think provider-side billing alerts would handle this. They don't, and here's why:
They're threshold-based, not rate-based. Anthropic sent me an email when I crossed $100. Fine. But by the time I read the email, I was at $167. At $50/hour burn rate, thresholds don't help, you need rate alerts, and providers don't offer them.
They're email-only. You can't wire a billing alert to a kill-switch. At best, they notify a human. If the human isn't awake, the burn continues.
They're coarse-grained. You get one threshold per account. You can't say "alert me if any single conversation exceeds $5" or "kill any request with context over 180K tokens."
They're post-hoc. The billing system is aggregating usage in 15-minute windows, not real-time. By the time it decides you've crossed a threshold, you're already 15 minutes of burn past it.
So if you want to catch runaway spend in real time, you have to build it yourself. Or put it in the infrastructure layer.
What Guardian Does
Guardian is the system we built after that 3AM incident. It sits in the HiWay2LLM proxy and inspects every single request before it hits the upstream provider. It looks for four specific categories of trouble:
1. Loop Detection via Request Fingerprinting. Every request gets a fingerprint, a hash of the user message, system prompt, and conversation tail (last 3 turns). If the same fingerprint shows up more than N times within a window, we block it.
In our 3AM incident, the agent was sending the same "re-read the document and extract insights" prompt with a slightly longer context each time. The user message was identical. Fingerprinting catches it on call #3.
Toggleable. Default: 5 identical fingerprints in 5 minutes = block.
2. Context Size Throttling. We count tokens on the way in (using the upstream provider's tokenizer so we're accurate). Three thresholds:
- At 50K tokens: log a warning to the user's dashboard.
- At 100K tokens: throttle, require exponential backoff between calls.
- At 200K tokens: hard block, require manual acknowledgment to proceed.
Our agent was sending 210K-token contexts. Guardian would have blocked at 200K, before the first expensive call ever reached Anthropic.
3. Zombie Agent Detection. If an API key is firing requests outside configured business hours, with no human-interaction signal (no web UI session tokens, no interactive headers), Guardian flags it. You opt in per key: "this key is for nightly batch, it's allowed to fire at 3AM" or "this key is for my chatbot, if it's firing at 3AM something's wrong."
4. Cost Spike Alerting. We track rolling hourly burn rate per key. If the current hour is 3× the trailing 24-hour average, Guardian can:
- Send a webhook (Slack, PagerDuty, your own endpoint).
- Auto-throttle the key to N requests per minute.
- Hard-kill all further requests until you re-enable.
All three actions are independently toggleable. You pick how paranoid you want to be.
A Peek at the Fingerprinting
The hash function is simpler than you'd expect. Here's roughly what we do:
function fingerprintRequest(req: LLMRequest): string {
const lastThreeTurns = req.messages.slice(-3);
const systemPrompt = req.messages.find(m => m.role === "system")?.content ?? "";
const normalized = {
system: systemPrompt.trim().slice(0, 500),
turns: lastThreeTurns.map(m => ({
role: m.role,
content: m.content.trim().slice(0, 1000),
})),
model: req.model,
};
return sha256(JSON.stringify(normalized)).slice(0, 16);
}
Two things that matter:
- We don't fingerprint the full context. An agent's conversation accumulates tokens over time, so a full-context fingerprint would never match twice. We fingerprint the intent (last user turn + system + tail), which is stable even when the conversation grows.
- We normalize aggressively. Trim whitespace, slice to fixed length. Otherwise a timestamp or session ID would invalidate the fingerprint and we'd miss the loop.
This is a 40-line function. That's the whole thing. The work was figuring out which parts to fingerprint, not the crypto.
Why It's Best Handled at the Infra Layer
Protection like this only works reliably when it lives in front of every request, with visibility into the whole traffic shape — not stitched into each app's retry path. A shared router is the natural place for it:
- Deduplication needs global state across every client.
- Burn-rate math only makes sense when you see all the spend.
- Kill-switches have to interrupt in-flight calls, not just throttle the next one.
- Thresholds evolve as model prices shift; you don't want that tuning scattered across apps.
You can absolutely build a narrow version inside a single app. Teams do, and it works until the traffic shape changes. Moving the concern one layer down means every app on HiWay2LLM inherits it without writing a line.
The Fix That Stuck
After deploying Guardian internally, we replayed the 3AM incident against the new proxy in staging. Guardian blocked call #3. Total damage in the simulated replay: $8.10 instead of $201.
We haven't had a runaway agent incident since. Not one. In the first month, Guardian blocked:
- 3 health-check loops (total would-be cost: $340)
- 12 context bloat events over 150K tokens
- 1 zombie dev-env agent that started retrying at 2 AM on a Saturday
That's real money saved on real patterns, in one month, inside one company. The moment you have it running, you realize every team should have this. No provider offers it. No reseller SaaS offers it. You can either build it or have it.
If you're running LLM calls in production, you need this layer. Not eventually, now. The 3AM call you don't want to get is the 3AM call you didn't know was possible.
No credit card required
Related reading: How We Cut Our LLM Costs by 85%, The Hidden Math of LLM Pricing.
Was this useful?
Comments
Be the first to comment.