Skip to content

Agentic Cache

Agentic Cache reuses expensive work across your agent and LLM traffic - model responses, tool results, embeddings, and MCP tool discovery - but only serves a cached entry when the same caller is still authorized for it. Every cache is keyed to the authorization boundary (tenant + workspace + virtual key + principal class), so a result cached for one caller can never reach another, and an entry that was masked for privacy is never re-served unmasked.

That makes it safe to cache traffic that a plain LLM proxy would have to treat as un-cacheable. You get the cost and latency wins of caching without giving up tenant isolation, masking obligations, or per-caller scoping.

Key benefits:

  • Lower cost - repeat work short-circuits the upstream provider or tool call and is credited the exact tokens, dollars, and latency the real call would have spent.
  • Lower latency - boundary-scoped hits return in microseconds instead of a multi-second round trip.
  • Safe by construction - a hit still requires a valid (cached) ALLOW for the caller, respects masking obligations, and is bounded by your revocation SLA. A cross-boundary serves counter on the console is always 0 to prove it.
  • Five cache kinds, one switch - exact response, semantic, tool result, embedding/RAG, and MCP discovery, each independently toggled with its own TTL.
  • Live, no-restart tuning - flip a cache, change the similarity threshold, or adjust a TTL from the workspace console and it applies immediately.

DeepintShield ships three caching capabilities that work together. Agentic Cache is the verdict-gated, boundary-scoped layer that ties them to authorization.

CapabilityWhat it cachesScope
Agentic Cache (this page)Responses, tool results, embeddings, MCP discoveryAuthorization boundary; gated by the caller’s verdict
Semantic cachingLLM responses by exact + vector similarityPer model/provider, with virtual-key scoping
Provider prompt cachingThe static prefix of a prompt, at the upstream providerPer provider

Agentic Cache attributes the savings from your semantic cache into its response (exact) and semantic kinds, so you see one reconciled $/token saved figure across the console rather than two competing numbers.

  • You run multi-tenant or multi-user agent traffic and need cache reuse without any risk of one caller seeing another’s response.
  • Your agents call idempotent read tools (lookups, search, retrieval) whose results are safe to reuse for a short window.
  • You repeat embedding or RAG retrieval for the same content and want to skip the embedding API round trip and its token cost.
  • Your agents repeatedly list MCP tools (tools/list discovery) and you want that handshake served from cache.
  • You need cached responses to stay fully governed - masking obligations and per-caller scoping preserved on every hit.

All per-workspace controls live on the Agentic Cache → Settings page; the per-cache enable toggles live next to each cache’s live hit rate on the Agentic Caches page.

  1. Open Workspace → Agentic Cache → Settings.

  2. Under Master, turn on Agentic cache enabled to enable caching for the workspace. (Turning this off disables every agentic cache here.)

  3. Under Semantic & safety, set:

    • Similarity threshold (0–1) - how close a request must be to reuse a semantic hit. Higher is more conservative (fewer false matches).
    • Semantic read-only - never serve a semantic hit for a write or high-recovery action.
    • Never cache high-risk / write tools - restrict tool-result caching to idempotent reads only.
    • Encrypt at rest - encrypt cached payloads in the shared tier.
    • Honor obligations on hit - a masked entry is never re-served unmasked.
  4. Under TTLs (seconds), set the time-to-live for the Exact response, Semantic, and Tool-result caches.

  5. Click Save.

  6. Go to Agentic Cache → Agentic Caches and use each row’s toggle to enable or disable an individual cache kind (exact response, semantic, tool-result, embedding, MCP discovery). Each row shows its live hit rate and entry count so you can see the before/after as you flip a cache.

The Security Caches page lists the read-mostly caches that keep the authorization path near-zero latency (decision/verdict, policy, key config). Those are invalidated structurally and on revocation push - see Agentic Security for the decision cache.

FieldTypeDefaultDescription
enabledbooleantrueMaster switch for all agentic caches in the workspace.
response_enabledbooleantrueExact-response cache (byte-identical repeat requests).
semantic_enabledbooleantrueSemantic (vector-similarity) cache.
tool_result_enabledbooleantrueTool-result cache (idempotent reads).
embedding_enabledbooleantrueEmbedding / RAG retrieval cache.
mcp_discovery_enabledbooleantrueMCP tools/list discovery cache.
semantic_thresholdnumber (0–1)0.92Similarity required for a semantic hit; higher is more conservative.
semantic_read_onlybooleantrueNever serve a semantic hit for a write / high-recovery action.
never_cache_high_riskbooleantrueRestrict tool-result caching to idempotent reads only.
encrypt_at_restbooleantrueEncrypt cached payloads in the shared tier.
honor_obligationsbooleantrueA masked entry is never re-served unmasked.
response_ttl_secondsinteger3600TTL for the exact-response cache.
semantic_ttl_secondsinteger1800TTL for the semantic cache.
tool_result_ttl_secondsinteger600TTL for the tool-result cache.
KindWhat it reusesNotes
Exact responseByte-identical repeat LLM requestsFastest path; configurable TTL.
SemanticLLM responses for similar (not identical) requestsRead-only by default; conservative similarity threshold.
Tool resultResults of idempotent read toolsWrite / high-risk tools are excluded by default.
EmbeddingEmbedding / RAG retrievalSkips the embedding API round trip and its token cost.
MCP discoveryMCP tools/list discovery responsesSpeeds up repeated tool-listing handshakes.

Caching reuse is bounded by the scope that distinguishes one caller from another inside a shared virtual key. By default the gateway scopes automatically; you can pin a scope mode on each virtual key for tighter control.

The available scope modes are:

ModeA cached entry is shared across…
virtual_keyAll callers using the same virtual key.
userThe same end user (resolved from request user / governance identity).
use_caseThe same use_case metadata value.
sessionThe same session.
custom_metadataThe same value of the metadata keys you nominate.
noneNo reuse - caching is effectively off for the key.

On a virtual key (Workspace → Virtual Keys → edit), the cache controls are:

  • Automatic Cache - enable or disable automatic cache scoping for the key.
  • Semantic Cache - enable or disable semantic matching for the key.
  • Scope mode - pick the scope (virtual_key, user, use_case, session, custom_metadata, or none) and, for a custom scope, the metadata keys that define it (for example use_case, or session_id for a session-scoped key).
  • Allow semantic reuse on unscoped requests - leave off if several end users share one key and you don’t want one user’s response served to another. When off, semantic lookups on a key with no per-caller scope are suppressed.
  • Cache Key - an optional fixed cache key for the key. Leave it empty to use automatic scoping; a request-level x-bf-cache-key header still overrides it.

For MCP tool-result caching you can override the TTL per tool - including 0s to disable caching for a specific tool - alongside the workspace MCP cache settings on the Agentic Cache → Settings page. Each entry maps a "<server>-<tool>" name to a duration string:

{
"search-web_search": "5m",
"db-run_query": "0s"
}

Only tools explicitly marked cacheable are eligible; write or high-risk tools stay excluded. See MCP tool execution for marking tools cacheable.

The Agentic Cache → Overview page shows the live picture:

  • Decision-cache hit and Agentic-cache hit rates, plus calls skipped.
  • Semantic-cache hit rate (LLM calls skipped).
  • Saved (24h) - dollars, tokens, and latency saved.
  • Cross-boundary serves - always 0, proving the isolation guarantee. If this is ever non-zero, the tile turns red.

On a cache miss the gateway records what the real upstream call cost; each later boundary-scoped hit short-circuits that call and is credited the same amount. These savings flow, additively, from a single shared source into the Overview, Cost, AI Logs, MCP & Agents Logs, and Agent Insights views, so every report reconciles. Open Agent Insights → Caching for the per-kind time series.

Conservative multi-tenant default - keep everything safe-by-default but cut cost on repeat reads:

{
"enabled": true,
"response_enabled": true,
"semantic_enabled": true,
"tool_result_enabled": true,
"semantic_read_only": true,
"never_cache_high_risk": true,
"honor_obligations": true,
"semantic_threshold": 0.92,
"response_ttl_seconds": 3600,
"semantic_ttl_seconds": 1800,
"tool_result_ttl_seconds": 600
}

High-throughput chatbot with shared key, scoped per user - semantic reuse within a single end user only:

{
"cache_enabled": true,
"semantic_cache_enabled": true,
"cache_scope_mode": "user",
"cache_allow_semantic_when_unscoped": false
}
  • Semantic caching - the response-similarity cache whose savings feed the agentic cache’s response and semantic kinds.
  • Provider prompt caching - cache the static prompt prefix at the upstream provider.
  • Virtual keys - the boundary that scopes every cached entry, plus the per-key cache controls.
  • MCP tool execution - mark tools cacheable and govern tool calls.