Semantic cache (Scale+)
Qdrant-backed, cosine ≥ 0.92. Skip identical and near-identical requests entirely.
A large share of agent traffic is repetitive. Semantic cache hashes the embedding of every prompt + system + tool schema, stores the full response, and replays cached answers when a later request lands within cosine similarity ≥ 0.92 of an existing entry. No upstream call, no tokens billed, ~20 ms total latency.
Scale plan and above
Semantic cache is available on Scale, Business and Enterprise. Build and Free get exact-match deduplication through Guardian, but not the Qdrant-backed fuzzy cache.
How similarity works
- Every incoming request is embedded with a small local model (no external call).
- A Qdrant lookup finds the nearest cached entry in the workspace namespace.
- If cosine similarity
≥ 0.92AND the tool schema / temperature / model hints match, we replay the cached response. - Cache entries expire after 24 hours by default (configurable per workspace).
What to look for in the response
{
"_hiway": {
"cache_hit": true,
"cache_similarity": 0.971,
"routed_model": "cache",
"routed_tier": "cache"
}
}Use PII masking upstream of cache
If your prompts include user-specific identifiers (email, phone, account ID), the raw embedding may leak them through similarity. Enable PII masking — it runs before embedding and cache hashing.