Agentic Cache
Overview
Section titled “Overview”Agentic Cache reuses expensive work across your agent and LLM traffic - model
responses, tool results, embeddings, and MCP tool discovery - but only serves a
cached entry when the same caller is still authorized for it. Every cache is
keyed to the authorization boundary (tenant + workspace + virtual key + principal class), so a result cached for one caller can never reach another,
and an entry that was masked for privacy is never re-served unmasked.
That makes it safe to cache traffic that a plain LLM proxy would have to treat as un-cacheable. You get the cost and latency wins of caching without giving up tenant isolation, masking obligations, or per-caller scoping.
Key benefits:
- Lower cost - repeat work short-circuits the upstream provider or tool call and is credited the exact tokens, dollars, and latency the real call would have spent.
- Lower latency - boundary-scoped hits return in microseconds instead of a multi-second round trip.
- Safe by construction - a hit still requires a valid (cached) ALLOW for the caller, respects masking obligations, and is bounded by your revocation SLA. A cross-boundary serves counter on the console is always 0 to prove it.
- Five cache kinds, one switch - exact response, semantic, tool result, embedding/RAG, and MCP discovery, each independently toggled with its own TTL.
- Live, no-restart tuning - flip a cache, change the similarity threshold, or adjust a TTL from the workspace console and it applies immediately.
How it differs from the other caches
Section titled “How it differs from the other caches”DeepintShield ships three caching capabilities that work together. Agentic Cache is the verdict-gated, boundary-scoped layer that ties them to authorization.
| Capability | What it caches | Scope |
|---|---|---|
| Agentic Cache (this page) | Responses, tool results, embeddings, MCP discovery | Authorization boundary; gated by the caller’s verdict |
| Semantic caching | LLM responses by exact + vector similarity | Per model/provider, with virtual-key scoping |
| Provider prompt caching | The static prefix of a prompt, at the upstream provider | Per provider |
Agentic Cache attributes the savings from your semantic cache into its
response (exact) and semantic kinds, so you see one reconciled
$/token saved figure across the console rather than two competing numbers.
When to use it
Section titled “When to use it”- You run multi-tenant or multi-user agent traffic and need cache reuse without any risk of one caller seeing another’s response.
- Your agents call idempotent read tools (lookups, search, retrieval) whose results are safe to reuse for a short window.
- You repeat embedding or RAG retrieval for the same content and want to skip the embedding API round trip and its token cost.
- Your agents repeatedly list MCP tools (
tools/listdiscovery) and you want that handshake served from cache. - You need cached responses to stay fully governed - masking obligations and per-caller scoping preserved on every hit.
Configuration
Section titled “Configuration”All per-workspace controls live on the Agentic Cache → Settings page; the per-cache enable toggles live next to each cache’s live hit rate on the Agentic Caches page.
-
Open Workspace → Agentic Cache → Settings.
-
Under Master, turn on Agentic cache enabled to enable caching for the workspace. (Turning this off disables every agentic cache here.)
-
Under Semantic & safety, set:
- Similarity threshold (0–1) - how close a request must be to reuse a semantic hit. Higher is more conservative (fewer false matches).
- Semantic read-only - never serve a semantic hit for a write or high-recovery action.
- Never cache high-risk / write tools - restrict tool-result caching to idempotent reads only.
- Encrypt at rest - encrypt cached payloads in the shared tier.
- Honor obligations on hit - a masked entry is never re-served unmasked.
-
Under TTLs (seconds), set the time-to-live for the Exact response, Semantic, and Tool-result caches.
-
Click Save.
-
Go to Agentic Cache → Agentic Caches and use each row’s toggle to enable or disable an individual cache kind (exact response, semantic, tool-result, embedding, MCP discovery). Each row shows its live hit rate and entry count so you can see the before/after as you flip a cache.
The Security Caches page lists the read-mostly caches that keep the authorization path near-zero latency (decision/verdict, policy, key config). Those are invalidated structurally and on revocation push - see Agentic Security for the decision cache.
Settings reference
Section titled “Settings reference”| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Master switch for all agentic caches in the workspace. |
response_enabled | boolean | true | Exact-response cache (byte-identical repeat requests). |
semantic_enabled | boolean | true | Semantic (vector-similarity) cache. |
tool_result_enabled | boolean | true | Tool-result cache (idempotent reads). |
embedding_enabled | boolean | true | Embedding / RAG retrieval cache. |
mcp_discovery_enabled | boolean | true | MCP tools/list discovery cache. |
semantic_threshold | number (0–1) | 0.92 | Similarity required for a semantic hit; higher is more conservative. |
semantic_read_only | boolean | true | Never serve a semantic hit for a write / high-recovery action. |
never_cache_high_risk | boolean | true | Restrict tool-result caching to idempotent reads only. |
encrypt_at_rest | boolean | true | Encrypt cached payloads in the shared tier. |
honor_obligations | boolean | true | A masked entry is never re-served unmasked. |
response_ttl_seconds | integer | 3600 | TTL for the exact-response cache. |
semantic_ttl_seconds | integer | 1800 | TTL for the semantic cache. |
tool_result_ttl_seconds | integer | 600 | TTL for the tool-result cache. |
Cache kinds
Section titled “Cache kinds”| Kind | What it reuses | Notes |
|---|---|---|
| Exact response | Byte-identical repeat LLM requests | Fastest path; configurable TTL. |
| Semantic | LLM responses for similar (not identical) requests | Read-only by default; conservative similarity threshold. |
| Tool result | Results of idempotent read tools | Write / high-risk tools are excluded by default. |
| Embedding | Embedding / RAG retrieval | Skips the embedding API round trip and its token cost. |
| MCP discovery | MCP tools/list discovery responses | Speeds up repeated tool-listing handshakes. |
Per-virtual-key cache scope
Section titled “Per-virtual-key cache scope”Caching reuse is bounded by the scope that distinguishes one caller from another inside a shared virtual key. By default the gateway scopes automatically; you can pin a scope mode on each virtual key for tighter control.
The available scope modes are:
| Mode | A cached entry is shared across… |
|---|---|
virtual_key | All callers using the same virtual key. |
user | The same end user (resolved from request user / governance identity). |
use_case | The same use_case metadata value. |
session | The same session. |
custom_metadata | The same value of the metadata keys you nominate. |
none | No reuse - caching is effectively off for the key. |
On a virtual key (Workspace → Virtual Keys → edit), the cache controls are:
- Automatic Cache - enable or disable automatic cache scoping for the key.
- Semantic Cache - enable or disable semantic matching for the key.
- Scope mode - pick the scope (
virtual_key,user,use_case,session,custom_metadata, ornone) and, for a custom scope, the metadata keys that define it (for exampleuse_case, orsession_idfor a session-scoped key). - Allow semantic reuse on unscoped requests - leave off if several end users share one key and you don’t want one user’s response served to another. When off, semantic lookups on a key with no per-caller scope are suppressed.
- Cache Key - an optional fixed cache key for the key. Leave it empty to use
automatic scoping; a request-level
x-bf-cache-keyheader still overrides it.
Per-tool MCP cache TTL
Section titled “Per-tool MCP cache TTL”For MCP tool-result caching you can override the TTL per tool - including 0s
to disable caching for a specific tool - alongside the workspace MCP cache
settings on the Agentic Cache → Settings page. Each entry maps a
"<server>-<tool>" name to a duration string:
{ "search-web_search": "5m", "db-run_query": "0s"}Only tools explicitly marked cacheable are eligible; write or high-risk tools stay excluded. See MCP tool execution for marking tools cacheable.
Monitoring savings
Section titled “Monitoring savings”The Agentic Cache → Overview page shows the live picture:
- Decision-cache hit and Agentic-cache hit rates, plus calls skipped.
- Semantic-cache hit rate (LLM calls skipped).
- Saved (24h) - dollars, tokens, and latency saved.
- Cross-boundary serves - always 0, proving the isolation guarantee. If this is ever non-zero, the tile turns red.
On a cache miss the gateway records what the real upstream call cost; each later boundary-scoped hit short-circuits that call and is credited the same amount. These savings flow, additively, from a single shared source into the Overview, Cost, AI Logs, MCP & Agents Logs, and Agent Insights views, so every report reconciles. Open Agent Insights → Caching for the per-kind time series.
Examples
Section titled “Examples”Conservative multi-tenant default - keep everything safe-by-default but cut cost on repeat reads:
{ "enabled": true, "response_enabled": true, "semantic_enabled": true, "tool_result_enabled": true, "semantic_read_only": true, "never_cache_high_risk": true, "honor_obligations": true, "semantic_threshold": 0.92, "response_ttl_seconds": 3600, "semantic_ttl_seconds": 1800, "tool_result_ttl_seconds": 600}High-throughput chatbot with shared key, scoped per user - semantic reuse within a single end user only:
{ "cache_enabled": true, "semantic_cache_enabled": true, "cache_scope_mode": "user", "cache_allow_semantic_when_unscoped": false}Next steps
Section titled “Next steps”- Semantic caching - the response-similarity cache whose savings feed the agentic cache’s response and semantic kinds.
- Provider prompt caching - cache the static prompt prefix at the upstream provider.
- Virtual keys - the boundary that scopes every cached entry, plus the per-key cache controls.
- MCP tool execution - mark tools cacheable and govern tool calls.