Skip to content

Semantic cache short-circuit

When a request matches a cached one - either exactly or by fuzzy semantic similarity - DeepintShield returns the cached answer immediately and skips both the guard checks and the provider call. On templated/chat workloads, this typically reclaims 30–60% of LLM spend.

On. No configuration needed - the cache is checked before any model call so hits are served as cheaply as possible.

If your workload has a very low hit rate and you’d rather not pay the lookup cost on misses, you can move the lookup to run only after guard checks:

Terminal window
DEEPINTSHIELD_SEMANTIC_LOOKUP_AFTER_GUARDS=true # check the cache later in the request

A cached response was already checked against your policies when it was first stored, and the cache key includes the policy version - so changing a policy automatically invalidates affected entries. You can never serve a response that wouldn’t pass your current guardrails.

WorkloadTypical hit rateCost saved
FAQ bot, customer support templates40–60%~50%
Internal copilot, repeated dev questions25–40%~30%
Long-form RAG, ad-hoc creative prompts5–15%~10%
Streaming code completion<5%Minimal

The cache is opt-in per workspace via the Cost Optimization settings - you can also disable it for VKs that must hit the model every time (e.g. fresh research queries).