Provider Prompt Caching

Overview

Most chat and agent workloads send the same large static prefix on every request - a long system prompt, a tool catalog, a retrieved document. Provider prompt caching tells the upstream model provider to cache that prefix so it is billed and processed once, then reused on subsequent requests at a steep discount.

DeepintShield turns this on for you: it marks the cacheable portions of each request, forwards or translates the provider-specific cache hints, and lets you control the cache duration (TTL) and which providers it applies to - all from one workspace setting, with no change to your application code.

Key benefits:

Lower cost - cached input tokens are billed at a large discount (roughly 90% off on Anthropic, 50% off on OpenAI, 75% off on Gemini).
Lower latency - the provider skips re-processing the cached prefix.
Zero app changes - the gateway adds and governs the cache hints; your request payloads stay the same.
Central control - one switch and one TTL per provider, applied across every virtual key (or scoped to specific ones).

Provider coverage

Caching support - and whether you can set a duration - differs by provider. DeepintShield only sends cache hints to providers on the allow-list and that actually support caching; for everyone else the hints are stripped.

Provider	Cached?	TTL you can set	Notes
Anthropic (Claude)	Yes	`5m` or `1h`	DeepintShield enforces the TTL you choose. `1h` carries a 2× cache-write premium.
Google / Gemini	Yes	`5m`, `1h`, `6h`, `24h`	Uses Gemini context caching; the duration is applied when the cached content is created.
OpenAI	Yes	- (automatic)	OpenAI caches eligible prefixes automatically; there is no duration to set.
Amazon Bedrock	Yes	- (no TTL)	Bedrock cache points have no configurable duration.
Other providers	No	-	Cache hints are not sent.

Configuration

Provider prompt caching is part of the Semantic Cache plugin and is enabled by default. You control it at the workspace level.

Open Config → Caches.
In the Provider Prompt Caching card, confirm the master toggle is on.
Expand Advanced parameters to set:
- Anthropic Cache TTL - 5 minutes (default) or 1 hour (2× write premium).
- Gemini Cache TTL - 5m, 1h (default), 6h, or 24h.
- Min Static Prefix (tokens) - skip caching prefixes smaller than this (default 1024).
Save. The change takes effect immediately - no restart.

Field reference

Field	Type	Default	Description
`prompt_cache_enabled`	boolean	`true`	Master switch. When off, all provider cache hints are stripped.
`prompt_cache_providers`	string[]	`["anthropic", "openai", "bedrock"]`	Providers whose cache hints are honored. Add `google` to include Gemini.
`prompt_cache_anthropic_ttl`	string	`"5m"`	Anthropic cache duration: `"5m"` (no write premium) or `"1h"` (2× write premium).
`prompt_cache_google_ttl`	string	`"1h"`	Gemini context-cache duration: `"5m"`, `"1h"`, `"6h"`, or `"24h"`.
`prompt_cache_min_static_tokens`	integer	`1024`	Don’t mark prefixes smaller than this for caching.
`prompt_cache_breakpoints`	string[]	`["system", "tools"]`	Which static portions to cache: `"system"`, `"tools"`, `"large_blocks"`.
`prompt_cache_vk_scope`	string[]	`[]` (all)	Limit prompt caching to specific virtual key IDs. Empty applies to all.

Choosing a TTL

The TTL is how long the provider keeps the cached prefix warm after the last use. Pick it based on how frequently your prefix is reused:

5m (Anthropic default) - best for steady, bursty traffic where the same prefix is hit again within a few minutes. No cache-write premium.
1h and longer - best for prefixes reused over longer windows (long-lived agent sessions, periodic batch jobs). Anthropic’s 1h doubles the one-time cache-write cost, so it pays off only when reuse is frequent enough to amortize it.

DeepintShield is the single source of truth for the Anthropic duration: whatever your clients send, the gateway applies the workspace TTL, so cost stays predictable across every integration and SDK.

Cost reporting

Cached reads and writes are reflected in your usage and cost metrics automatically - the provider returns cached-token counts, and DeepintShield attributes the savings per request. See Observability for where cache savings appear.

Next steps

Semantic Caching - serve whole responses for similar requests.
Performance & Cost - all the latency and cost optimizations.
Virtual Keys - scope caching and budgets per key.