
How HiWay2LLM Tamed OpenClaw - and Its Budget Drift
The reference guide for controlling autonomous agent costs
OpenClaw is the AI agent runtime that hit 350k GitHub stars in 60 days. It's also capable of silently burning your budget while you sleep. Here are the 5 drift patterns, a real incident, and the infra layer that stops them - without touching a single line of agent code.
OpenClaw hit 350,000 GitHub stars in 60 days. Faster than React. For anyone running AI agents in production, that's the good news.
The bad news: the framework still has no native budget guardrails. No maxCostPerSession. No maxToolCallsPerMinute. No automatic kill-switch. These features have been requested for months in GitHub issues (#58826, #10614). Status: open.
This post is the guide I wish I had before learning this the hard way.
Why OpenClaw Drifts Budgetarily
A classic chatbot is one request, one response. OpenClaw is an agent that loops: it plans, executes tools, inspects results, re-plans. That's what makes it powerful. It's also the source of 5 distinct drift classes.
Pattern 1 - The O(n²) Context Explosion
This is the quietest and most devastating.
At each turn of an agent loop, OpenClaw sends the entire conversation history to the LLM. Turn 1 = 2,000 tokens. Turn 5 = 10,000 tokens. Turn 10 = 20,000 tokens. Accumulation is linear, but the cost - because each API call bills the full input tokens - becomes quadratic across the total session.
A 30-step pipeline that should cost $3 can cost $45 if history isn't pruned. OpenClaw's built-in context compression helps, but it's lossy by design: summaries generated by a lightweight model don't necessarily preserve implicit constraints or negative rules.
Real data point: a 35-message Telegram session can generate a 2.9 MB session file. With 17 active agents over 4 weeks, that's 4.7 GB of unbounded transcripts on disk, and the token equivalent in RAM for every API call.
Pattern 2 - Tool Call Loops
OpenClaw has native loop detection (tool-loop-detection). It doesn't cover all cases.
The classic pattern: a tool returns an unexpected result → the agent decides to "retry with more context" → it sends the same tool call enriched with the previous response → the tool returns another unexpected result → loop.
GitHub issue #43802 documents a real case: a sub-agent notification agent that sent confirmations every 3-5 seconds. In 3 minutes: more than 100 API calls. Cause: with delivery.mode: "none", parent-agent notifications were self-feeding.
The built-in post-compaction detection catches identical (tool, args, result) triples. It doesn't see loops where arguments vary slightly each turn.
Pattern 3 - Ghost Memory Post-Compaction
OpenClaw automatically compresses context when the window fills. That's necessary. It's also risky.
The community-documented issue: after a compaction, the agent can forget critical instructions present at the beginning of the session. Reported case: an agent that had received the instruction "don't do anything until I say go" ignored this constraint after a compaction and started executing actions autonomously.
Compaction is lossy by design. Summaries generated by a lightweight model don't necessarily preserve implicit constraints or negative rules.
Pattern 4 - Unbounded Session File Growth
GitHub issue #66360: .jsonl transcript files have no size limit. The gateway process accumulates context indefinitely.
Measured consequence: 69% CPU, 1.9 GB RAM after 13 hours of active session. Without intervention, the process eventually OOMs or exceeds model limits.
Pattern 5 - Retry Spirals
The most expensive per isolated incident.
An agent encounters an error → it retries → the retry sends the full history + the previous error result → the context grows → the next retry sends even more → and so on.
Community-documented case: a LangChain agent in retry loop = $47,000 in API charges over 11 days. The number is extreme, but the pattern is common. Each retry costs more than the previous one.
The ARES Incident
This is the real case that triggered building guardrails into HiWay2LLM.
We had an agent called ARES - a background embedding pipeline. Over a 96-hour period, Guardian (our internal monitoring system, then in beta) detected 44 timeouts with an identical fingerprint. Same request, same pattern, every few hours.
The problem: Guardian was then in-memory and silent. The 44 detections had been logged. Nobody had been alerted. We discovered the issue by accidentally stumbling on the config page a week later.
ARES was a zombie. An agent loop repeating the same embedding operation without ever advancing, silently draining budget and resources for four days.
What was missing at the infra layer:
- A persistent alert channel (email, Telegram, Slack)
- Fingerprinting with automatic action (block, not just log)
- Non-volatile monitoring storage (not an in-memory buffer)
What We Built: HiWay2LLM's Guardian Layer
The solution isn't in the agent. It's in the layer between the agent and the provider.
HiWay2LLM sits between OpenClaw and Anthropic/OpenAI. Every request passes through the proxy before reaching the upstream provider. This positioning gives visibility that neither the agent nor the provider can have alone.
BudgetPolicy - Hard Caps
Each tenant (OpenClaw workspace) gets a BudgetPolicy with configurable hard caps: a daily spending limit, a monthly limit, and a per-request ceiling. When any cap is reached, degradation is gradual - progressively shifting traffic toward lighter models before cutting off entirely if the limit is breached. This means a runaway agent never kills the entire budget in one shot.
Guardian - The 5 Active Rules
Rule 1 - Dedup
Identical fingerprint repeated beyond a configurable threshold → automatic block.
We fingerprint intention, not the entire context.
Toggleable per workspace, per tier.
Rule 2 - Cost Spike
Current hourly burn significantly exceeds the rolling average → throttle or block.
Configurable action: webhook / auto-throttle / hard-kill.
Rule 3 - Context Bloat
Input tokens measured in under a millisecond.
Progressive thresholds: dashboard warning → throttle → hard block requiring manual acknowledgment.
Rule 4 - Zombie Agent
Sustained API activity during off-hours with no human interaction signal → block.
Opt-in per key: legitimate batch jobs are whitelisted.
Rule 5 - Passthrough Loop
Identical fingerprint on the direct provider proxy path, repeated beyond a configurable window → block.
This is the rule that would have stopped ARES.
Persistent Alerts - 3 Channels
After the ARES incident, we ditched the in-memory buffer. Guardian interventions are now persisted in a non-volatile database table with timestamp, rule, workspace, and action.
Three alert channels wired:
| Channel | Throttle | What it sends |
|---|---|---|
| 1 alert/hour/workspace | Rule triggered, estimated cost avoided, dashboard link | |
| Telegram | Per-user pairing | Immediate push with context |
| Slack | Configurable webhook | Mention + detail block, filterable by rule |
First month after full deployment:
- 3 health-check loops blocked (estimated cost avoided: $340)
- 12 context bloat events beyond 150k tokens
- 1 zombie agent on a dev env, Saturday at 2am
Intelligent Routing for OpenClaw Agents
Budget control alone isn't enough. The other half of the problem: OpenClaw sends everything to the same model regardless of the actual task complexity.
Two OpenClaw agent configurations run on our infrastructure with HiWay2LLM as the backend:
ONYX (budget-first): routes toward lighter models by default. Haiku handles the bulk of traffic; Sonnet takes medium complexity; Opus is reserved for the most demanding tasks.
NYX (quality-first): never uses the lightest models. Sonnet is the default; Opus is activated as soon as task complexity exceeds a threshold.
The routing score is calculated across several independent axes in under a millisecond - request complexity, presence of tool calls, system prompt weight, and reasoning signals among others.
The configuration is a single line change in OpenClaw:
# Before
client = OpenAI(base_url="https://api.anthropic.com/v1", api_key="sk-ant-...")
# After - everything else is identical
client = OpenAI(base_url="https://api.hiway2llm.com/v1", api_key="sk-your-hiway-key")
Response headers pass back the routing decision - tier assigned, reason, and estimated cost - for observability.
What OpenClaw's Native Solutions Cover (and Don't)
To be honest about the framework's current state:
Natively implemented:
- Session pruning (in-memory, doesn't modify the disk transcript)
- Post-compaction guard (detect identical tool/args/result triples)
- Tool loop detection (rolling history of tool calls)
- Aggressive trim:
toolResultmessages can be replaced with[Old tool result content cleared]
NOT natively implemented (as of May 17, 2026):
- Token budgets per session
- Request rate limits per agent
- Budget spending limits with automatic kill-switch
- Real-time multi-channel alerts
- Zombie detection
GitHub issues requesting these features have been open for months. The community uses workarounds: manual monitoring, cron scripts, or third-party solutions. The good news: a proxy layer like HiWay2LLM fills exactly this gap without patching the agent.
The ROI of Budget Control
Some concrete numbers on combined cost reduction (Guardian + routing):
- Context bloat prevention: by intercepting requests that exceed input token thresholds, we avoid the most expensive calls. With 12 blocked events in a month, that's ~$27 saved on that pattern alone.
- Zombie prevention: the ARES incident in simulated replay = $8.10 instead of unbounded budget over 96h.
- Intelligent routing: shifting 60% of agent traffic from Opus to Haiku/Sonnet = 70% reduction in average cost per call on medium-complexity tasks.
The internal audit (SMART_ROUTER_AUDIT, 2026-04-22) summarizes: all P0 and P1 resolved, 69 tests in CI gate, overall score 9.0/10.
How to Connect OpenClaw to HiWay2LLM
If you're running OpenClaw with Claude as the backend, it's a one-line modification:
from anthropic import Anthropic
client = Anthropic(
base_url="https://api.hiway2llm.com",
api_key="sk-hiway-YOUR_KEY",
)
If you're using the OpenAI-compatible SDK:
from openai import OpenAI
client = OpenAI(
base_url="https://api.hiway2llm.com/v1",
api_key="sk-hiway-YOUR_KEY",
)
Then in the HiWay2LLM dashboard, configure:
- The
BudgetPolicyfor your workspace (daily, monthly, per-request caps) - Which Guardian rules to activate (all by default, refinable per workspace)
- Alert channels (email + optionally Telegram or Slack)
- Routing profile (ONYX-like or NYX-like depending on your cost/quality priority)
First Guardian alert usually within 24 hours if you have agents running. Not because your code is broken - because everyone has borderline patterns that nobody was seeing until now.
No agent code changes required
Related reading: We Watched an AI Agent Burn $200 at 3am, The Hidden Math of LLM Pricing, Agent-Aware LLM Routing.
Was this useful?
Comments
Be the first to comment.