Semantic cache

Qdrant-backed, cosine ≥ 0.92. Skip identical and near-identical requests entirely.

A large share of agent traffic is repetitive. Semantic cache hashes the embedding of every prompt + system + tool schema, stores the full response, and replays cached answers when a later request lands within cosine similarity ≥ 0.92 of an existing entry. No upstream call, no tokens consumed, ~20 ms total latency.

Included on Scale and Enterprise

Semantic cache is available on Scale and Enterprise plans. The Free plan does not include it.

How similarity works

Every incoming request is embedded with a small local model (no external call).
A Qdrant lookup finds the nearest cached entry in the workspace namespace.
If cosine similarity ≥ 0.92 AND the tool schema / temperature / model hints match, we replay the cached response.
Cache entries expire after 24 hours by default (configurable per workspace).

What to look for in the response

json

{
  "_hiway": {
    "cache_hit":        true,
    "cache_similarity": 0.971,
    "routed_model":     "cache",
    "routed_tier":      "cache"
  }
}

Use PII masking upstream of cache

If your prompts include user-specific identifiers (email, phone, account ID), the raw embedding may leak them through similarity. Enable PII masking - it runs before embedding and cache hashing.