Semantic cache (Scale+)

Qdrant-backed, cosine ≥ 0.92. Skip identical and near-identical requests entirely.

A large share of agent traffic is repetitive. Semantic cache hashes the embedding of every prompt + system + tool schema, stores the full response, and replays cached answers when a later request lands within cosine similarity ≥ 0.92 of an existing entry. No upstream call, no tokens billed, ~20 ms total latency.

Scale plan and above

Semantic cache is available on Scale, Business and Enterprise. Build and Free get exact-match deduplication through Guardian, but not the Qdrant-backed fuzzy cache.

How similarity works

  • Every incoming request is embedded with a small local model (no external call).
  • A Qdrant lookup finds the nearest cached entry in the workspace namespace.
  • If cosine similarity ≥ 0.92 AND the tool schema / temperature / model hints match, we replay the cached response.
  • Cache entries expire after 24 hours by default (configurable per workspace).

What to look for in the response

json
{
  "_hiway": {
    "cache_hit":        true,
    "cache_similarity": 0.971,
    "routed_model":     "cache",
    "routed_tier":      "cache"
  }
}

Use PII masking upstream of cache

If your prompts include user-specific identifiers (email, phone, account ID), the raw embedding may leak them through similarity. Enable PII masking — it runs before embedding and cache hashing.