Semantic cache short-circuit

When a request matches a cached one - either exactly or by fuzzy semantic similarity - DeepintShield returns the cached answer immediately and skips both the guard checks and the provider call. On templated/chat workloads, this typically reclaims 30–60% of LLM spend.

Default behavior

On. No configuration needed - the cache is checked before any model call so hits are served as cheaply as possible.

If your workload has a very low hit rate and you’d rather not pay the lookup cost on misses, you can move the lookup to run only after guard checks:

DEEPINTSHIELD_SEMANTIC_LOOKUP_AFTER_GUARDS=true  # check the cache later in the request

Why a cache hit is safe

A cached response was already checked against your policies when it was first stored, and the cache key includes the policy version - so changing a policy automatically invalidates affected entries. You can never serve a response that wouldn’t pass your current guardrails.

Realistic cost reduction

Workload	Typical hit rate	Cost saved
FAQ bot, customer support templates	40–60%	~50%
Internal copilot, repeated dev questions	25–40%	~30%
Long-form RAG, ad-hoc creative prompts	5–15%	~10%
Streaming code completion	`<5%`	Minimal

The cache is opt-in per workspace via the Cost Optimization settings - you can also disable it for VKs that must hit the model every time (e.g. fresh research queries).