Provider Prompt Caching
Overview
Section titled “Overview”Most chat and agent workloads send the same large static prefix on every request - a long system prompt, a tool catalog, a retrieved document. Provider prompt caching tells the upstream model provider to cache that prefix so it is billed and processed once, then reused on subsequent requests at a steep discount.
DeepintShield turns this on for you: it marks the cacheable portions of each request, forwards or translates the provider-specific cache hints, and lets you control the cache duration (TTL) and which providers it applies to - all from one workspace setting, with no change to your application code.
Key benefits:
- Lower cost - cached input tokens are billed at a large discount (roughly 90% off on Anthropic, 50% off on OpenAI, 75% off on Gemini).
- Lower latency - the provider skips re-processing the cached prefix.
- Zero app changes - the gateway adds and governs the cache hints; your request payloads stay the same.
- Central control - one switch and one TTL per provider, applied across every virtual key (or scoped to specific ones).
Provider coverage
Section titled “Provider coverage”Caching support - and whether you can set a duration - differs by provider. DeepintShield only sends cache hints to providers on the allow-list and that actually support caching; for everyone else the hints are stripped.
| Provider | Cached? | TTL you can set | Notes |
|---|---|---|---|
| Anthropic (Claude) | Yes | 5m or 1h | DeepintShield enforces the TTL you choose. 1h carries a 2× cache-write premium. |
| Google / Gemini | Yes | 5m, 1h, 6h, 24h | Uses Gemini context caching; the duration is applied when the cached content is created. |
| OpenAI | Yes | - (automatic) | OpenAI caches eligible prefixes automatically; there is no duration to set. |
| Amazon Bedrock | Yes | - (no TTL) | Bedrock cache points have no configurable duration. |
| Other providers | No | - | Cache hints are not sent. |
Configuration
Section titled “Configuration”Provider prompt caching is part of the Semantic Cache plugin and is enabled by default. You control it at the workspace level.
- Open Config → Caches.
- In the Provider Prompt Caching card, confirm the master toggle is on.
- Expand Advanced parameters to set:
- Anthropic Cache TTL -
5 minutes(default) or1 hour(2× write premium). - Gemini Cache TTL -
5m,1h(default),6h, or24h. - Min Static Prefix (tokens) - skip caching prefixes smaller than this (default
1024).
- Anthropic Cache TTL -
- Save. The change takes effect immediately - no restart.
Field reference
Section titled “Field reference”| Field | Type | Default | Description |
|---|---|---|---|
prompt_cache_enabled | boolean | true | Master switch. When off, all provider cache hints are stripped. |
prompt_cache_providers | string[] | ["anthropic", "openai", "bedrock"] | Providers whose cache hints are honored. Add google to include Gemini. |
prompt_cache_anthropic_ttl | string | "5m" | Anthropic cache duration: "5m" (no write premium) or "1h" (2× write premium). |
prompt_cache_google_ttl | string | "1h" | Gemini context-cache duration: "5m", "1h", "6h", or "24h". |
prompt_cache_min_static_tokens | integer | 1024 | Don’t mark prefixes smaller than this for caching. |
prompt_cache_breakpoints | string[] | ["system", "tools"] | Which static portions to cache: "system", "tools", "large_blocks". |
prompt_cache_vk_scope | string[] | [] (all) | Limit prompt caching to specific virtual key IDs. Empty applies to all. |
Choosing a TTL
Section titled “Choosing a TTL”The TTL is how long the provider keeps the cached prefix warm after the last use. Pick it based on how frequently your prefix is reused:
5m(Anthropic default) - best for steady, bursty traffic where the same prefix is hit again within a few minutes. No cache-write premium.1hand longer - best for prefixes reused over longer windows (long-lived agent sessions, periodic batch jobs). Anthropic’s1hdoubles the one-time cache-write cost, so it pays off only when reuse is frequent enough to amortize it.
DeepintShield is the single source of truth for the Anthropic duration: whatever your clients send, the gateway applies the workspace TTL, so cost stays predictable across every integration and SDK.
Cost reporting
Section titled “Cost reporting”Cached reads and writes are reflected in your usage and cost metrics automatically - the provider returns cached-token counts, and DeepintShield attributes the savings per request. See Observability for where cache savings appear.
Next steps
Section titled “Next steps”- Semantic Caching - serve whole responses for similar requests.
- Performance & Cost - all the latency and cost optimizations.
- Virtual Keys - scope caching and budgets per key.