Semantic cache short-circuit
When a request matches a cached one - either exactly or by fuzzy semantic similarity - DeepintShield returns the cached answer immediately and skips both the guard checks and the provider call. On templated/chat workloads, this typically reclaims 30–60% of LLM spend.
Default behavior
Section titled “Default behavior”On. No configuration needed - the cache is checked before any model call so hits are served as cheaply as possible.
If your workload has a very low hit rate and you’d rather not pay the lookup cost on misses, you can move the lookup to run only after guard checks:
DEEPINTSHIELD_SEMANTIC_LOOKUP_AFTER_GUARDS=true # check the cache later in the requestWhy a cache hit is safe
Section titled “Why a cache hit is safe”A cached response was already checked against your policies when it was first stored, and the cache key includes the policy version - so changing a policy automatically invalidates affected entries. You can never serve a response that wouldn’t pass your current guardrails.
Realistic cost reduction
Section titled “Realistic cost reduction”| Workload | Typical hit rate | Cost saved |
|---|---|---|
| FAQ bot, customer support templates | 40–60% | ~50% |
| Internal copilot, repeated dev questions | 25–40% | ~30% |
| Long-form RAG, ad-hoc creative prompts | 5–15% | ~10% |
| Streaming code completion | <5% | Minimal |
The cache is opt-in per workspace via the Cost Optimization settings - you can also disable it for VKs that must hit the model every time (e.g. fresh research queries).