Semantic cache

Skip identical and near-identical requests entirely - zero tokens, instant replies.

A large share of agent traffic is repetitive. The semantic cache recognises when a new request is effectively the same as a recent one and replays the stored response. No upstream call, no tokens consumed, ~20 ms total latency.

Included on Scale and Enterprise

Semantic cache is available on Scale and Enterprise plans. The Free plan does not include it.

How similarity works

Every incoming request is fingerprinted locally (no external call).
We look up the nearest stored entry within your workspace namespace.
If it is close enough to an existing entry and the request parameters match, we replay the cached response.
Cache entries expire after 24 hours by default (configurable per workspace).

What to look for in the response

json

{
  "_hiway": {
    "cache_hit":        true,
    "cache_similarity": 0.971,
    "routed_model":     "cache",
    "routed_tier":      "cache"
  }
}

Use PII masking upstream of cache

If your prompts include user-specific identifiers (email, phone, account ID), the raw embedding may leak them through similarity. Enable PII masking - it runs before embedding and cache hashing.