Agentic Cache

Overview

Agentic Cache reuses expensive work across your agent and LLM traffic - model responses, tool results, embeddings, and MCP tool discovery - but only serves a cached entry when the same caller is still authorized for it. Every cache is keyed to the authorization boundary (tenant + workspace + virtual key + principal class), so a result cached for one caller can never reach another, and an entry that was masked for privacy is never re-served unmasked.

That makes it safe to cache traffic that a plain LLM proxy would have to treat as un-cacheable. You get the cost and latency wins of caching without giving up tenant isolation, masking obligations, or per-caller scoping.

Key benefits:

Lower cost - repeat work short-circuits the upstream provider or tool call and is credited the exact tokens, dollars, and latency the real call would have spent.
Lower latency - boundary-scoped hits return in microseconds instead of a multi-second round trip.
Safe by construction - a hit still requires a valid (cached) ALLOW for the caller, respects masking obligations, and is bounded by your revocation SLA. A cross-boundary serves counter on the console is always 0 to prove it.
Five cache kinds, one switch - exact response, semantic, tool result, embedding/RAG, and MCP discovery, each independently toggled with its own TTL.
Live, no-restart tuning - flip a cache, change the similarity threshold, or adjust a TTL from the workspace console and it applies immediately.

How it differs from the other caches

DeepintShield ships three caching capabilities that work together. Agentic Cache is the verdict-gated, boundary-scoped layer that ties them to authorization.

Capability	What it caches	Scope
Agentic Cache (this page)	Responses, tool results, embeddings, MCP discovery	Authorization boundary; gated by the caller’s verdict
Semantic caching	LLM responses by exact + vector similarity	Per model/provider, with virtual-key scoping
Provider prompt caching	The static prefix of a prompt, at the upstream provider	Per provider

Agentic Cache attributes the savings from your semantic cache into its response (exact) and semantic kinds, so you see one reconciled $/token saved figure across the console rather than two competing numbers.

When to use it

You run multi-tenant or multi-user agent traffic and need cache reuse without any risk of one caller seeing another’s response.
Your agents call idempotent read tools (lookups, search, retrieval) whose results are safe to reuse for a short window.
You repeat embedding or RAG retrieval for the same content and want to skip the embedding API round trip and its token cost.
Your agents repeatedly list MCP tools (tools/list discovery) and you want that handshake served from cache.
You need cached responses to stay fully governed - masking obligations and per-caller scoping preserved on every hit.

Configuration

All per-workspace controls live on the Agentic Cache → Settings page; the per-cache enable toggles live next to each cache’s live hit rate on the Agentic Caches page.

Open Workspace → Agentic Cache → Settings.
Under Master, turn on Agentic cache enabled to enable caching for the workspace. (Turning this off disables every agentic cache here.)
Under Semantic & safety, set:
- Similarity threshold (0–1) - how close a request must be to reuse a semantic hit. Higher is more conservative (fewer false matches).
- Semantic read-only - never serve a semantic hit for a write or high-recovery action.
- Never cache high-risk / write tools - restrict tool-result caching to idempotent reads only.
- Encrypt at rest - encrypt cached payloads in the shared tier.
- Honor obligations on hit - a masked entry is never re-served unmasked.
Under TTLs (seconds), set the time-to-live for the Exact response, Semantic, and Tool-result caches.
Click Save.
Go to Agentic Cache → Agentic Caches and use each row’s toggle to enable or disable an individual cache kind (exact response, semantic, tool-result, embedding, MCP discovery). Each row shows its live hit rate and entry count so you can see the before/after as you flip a cache.

The Security Caches page lists the read-mostly caches that keep the authorization path near-zero latency (decision/verdict, policy, key config). Those are invalidated structurally and on revocation push - see Agentic Security for the decision cache.

Settings reference

Field	Type	Default	Description
`enabled`	boolean	`true`	Master switch for all agentic caches in the workspace.
`response_enabled`	boolean	`true`	Exact-response cache (byte-identical repeat requests).
`semantic_enabled`	boolean	`true`	Semantic (vector-similarity) cache.
`tool_result_enabled`	boolean	`true`	Tool-result cache (idempotent reads).
`embedding_enabled`	boolean	`true`	Embedding / RAG retrieval cache.
`mcp_discovery_enabled`	boolean	`true`	MCP `tools/list` discovery cache.
`semantic_threshold`	number (0–1)	`0.92`	Similarity required for a semantic hit; higher is more conservative.
`semantic_read_only`	boolean	`true`	Never serve a semantic hit for a write / high-recovery action.
`never_cache_high_risk`	boolean	`true`	Restrict tool-result caching to idempotent reads only.
`encrypt_at_rest`	boolean	`true`	Encrypt cached payloads in the shared tier.
`honor_obligations`	boolean	`true`	A masked entry is never re-served unmasked.
`response_ttl_seconds`	integer	`3600`	TTL for the exact-response cache.
`semantic_ttl_seconds`	integer	`1800`	TTL for the semantic cache.
`tool_result_ttl_seconds`	integer	`600`	TTL for the tool-result cache.

Cache kinds

Kind	What it reuses	Notes
Exact response	Byte-identical repeat LLM requests	Fastest path; configurable TTL.
Semantic	LLM responses for similar (not identical) requests	Read-only by default; conservative similarity threshold.
Tool result	Results of idempotent read tools	Write / high-risk tools are excluded by default.
Embedding	Embedding / RAG retrieval	Skips the embedding API round trip and its token cost.
MCP discovery	MCP `tools/list` discovery responses	Speeds up repeated tool-listing handshakes.

Per-virtual-key cache scope

Caching reuse is bounded by the scope that distinguishes one caller from another inside a shared virtual key. By default the gateway scopes automatically; you can pin a scope mode on each virtual key for tighter control.

The available scope modes are:

Mode	A cached entry is shared across…
`virtual_key`	All callers using the same virtual key.
`user`	The same end user (resolved from request user / governance identity).
`use_case`	The same `use_case` metadata value.
`session`	The same session.
`custom_metadata`	The same value of the metadata keys you nominate.
`none`	No reuse - caching is effectively off for the key.

On a virtual key (Workspace → Virtual Keys → edit), the cache controls are:

Automatic Cache - enable or disable automatic cache scoping for the key.
Semantic Cache - enable or disable semantic matching for the key.
Scope mode - pick the scope (virtual_key, user, use_case, session, custom_metadata, or none) and, for a custom scope, the metadata keys that define it (for example use_case, or session_id for a session-scoped key).
Allow semantic reuse on unscoped requests - leave off if several end users share one key and you don’t want one user’s response served to another. When off, semantic lookups on a key with no per-caller scope are suppressed.
Cache Key - an optional fixed cache key for the key. Leave it empty to use automatic scoping; a request-level x-bf-cache-key header still overrides it.

Per-tool MCP cache TTL

For MCP tool-result caching you can override the TTL per tool - including 0s to disable caching for a specific tool - alongside the workspace MCP cache settings on the Agentic Cache → Settings page. Each entry maps a "<server>-<tool>" name to a duration string:

{
  "search-web_search": "5m",
  "db-run_query": "0s"
}

Only tools explicitly marked cacheable are eligible; write or high-risk tools stay excluded. See MCP tool execution for marking tools cacheable.

Monitoring savings

The Agentic Cache → Overview page shows the live picture:

Decision-cache hit and Agentic-cache hit rates, plus calls skipped.
Semantic-cache hit rate (LLM calls skipped).
Saved (24h) - dollars, tokens, and latency saved.
Cross-boundary serves - always 0, proving the isolation guarantee. If this is ever non-zero, the tile turns red.

On a cache miss the gateway records what the real upstream call cost; each later boundary-scoped hit short-circuits that call and is credited the same amount. These savings flow, additively, from a single shared source into the Overview, Cost, AI Logs, MCP & Agents Logs, and Agent Insights views, so every report reconciles. Open Agent Insights → Caching for the per-kind time series.

Examples

Conservative multi-tenant default - keep everything safe-by-default but cut cost on repeat reads:

{
  "enabled": true,
  "response_enabled": true,
  "semantic_enabled": true,
  "tool_result_enabled": true,
  "semantic_read_only": true,
  "never_cache_high_risk": true,
  "honor_obligations": true,
  "semantic_threshold": 0.92,
  "response_ttl_seconds": 3600,
  "semantic_ttl_seconds": 1800,
  "tool_result_ttl_seconds": 600
}

High-throughput chatbot with shared key, scoped per user - semantic reuse within a single end user only:

{
  "cache_enabled": true,
  "semantic_cache_enabled": true,
  "cache_scope_mode": "user",
  "cache_allow_semantic_when_unscoped": false
}

Next steps

Semantic caching - the response-similarity cache whose savings feed the agentic cache’s response and semantic kinds.
Provider prompt caching - cache the static prompt prefix at the upstream provider.
Virtual keys - the boundary that scopes every cached entry, plus the per-key cache controls.
MCP tool execution - mark tools cacheable and govern tool calls.