Semantic cache

Skip identical and near-identical requests entirely - zero tokens, instant replies.

A large share of agent traffic is repetitive. The semantic cache recognises when a new request is effectively the same as a recent one and replays the stored response. No upstream call, no tokens consumed, ~20 ms total latency.

Included on Scale and Enterprise

Semantic cache is available on Scale and Enterprise plans. The Free plan does not include it.

How similarity works

  • Every incoming request is fingerprinted locally (no external call).
  • We look up the nearest stored entry within your workspace namespace.
  • If it is close enough to an existing entry and the request parameters match, we replay the cached response.
  • Cache entries expire after 24 hours by default (configurable per workspace).

What to look for in the response

json
{
  "_hiway": {
    "cache_hit":        true,
    "cache_similarity": 0.971,
    "routed_model":     "cache",
    "routed_tier":      "cache"
  }
}

Use PII masking upstream of cache

If your prompts include user-specific identifiers (email, phone, account ID), the raw embedding may leak them through similarity. Enable PII masking - it runs before embedding and cache hashing.