Semantic cache
Skip identical and near-identical requests entirely - zero tokens, instant replies.
A large share of agent traffic is repetitive. The semantic cache recognises when a new request is effectively the same as a recent one and replays the stored response. No upstream call, no tokens consumed, ~20 ms total latency.
Included on Scale and Enterprise
Semantic cache is available on Scale and Enterprise plans. The Free plan does not include it.
How similarity works
- Every incoming request is fingerprinted locally (no external call).
- We look up the nearest stored entry within your workspace namespace.
- If it is close enough to an existing entry and the request parameters match, we replay the cached response.
- Cache entries expire after 24 hours by default (configurable per workspace).
What to look for in the response
json
{
"_hiway": {
"cache_hit": true,
"cache_similarity": 0.971,
"routed_model": "cache",
"routed_tier": "cache"
}
}Use PII masking upstream of cache
If your prompts include user-specific identifiers (email, phone, account ID), the raw embedding may leak them through similarity. Enable PII masking - it runs before embedding and cache hashing.