Semantic cache

Qdrant-backed, cosine ≥ 0.92. Skip identical and near-identical requests entirely.

A large share of agent traffic is repetitive. Semantic cache hashes the embedding of every prompt + system + tool schema, stores the full response, and replays cached answers when a later request lands within cosine similarity ≥ 0.92 of an existing entry. No upstream call, no tokens consumed, ~20 ms total latency.

Included on Scale and Enterprise

Semantic cache is available on Scale and Enterprise plans. The Free plan does not include it.

How similarity works

  • Every incoming request is embedded with a small local model (no external call).
  • A Qdrant lookup finds the nearest cached entry in the workspace namespace.
  • If cosine similarity ≥ 0.92 AND the tool schema / temperature / model hints match, we replay the cached response.
  • Cache entries expire after 24 hours by default (configurable per workspace).

What to look for in the response

json
{
  "_hiway": {
    "cache_hit":        true,
    "cache_similarity": 0.971,
    "routed_model":     "cache",
    "routed_tier":      "cache"
  }
}

Use PII masking upstream of cache

If your prompts include user-specific identifiers (email, phone, account ID), the raw embedding may leak them through similarity. Enable PII masking - it runs before embedding and cache hashing.