Semantic Caching

Overview

Semantic caching uses vector similarity search to intelligently cache AI responses, serving cached results for semantically similar requests even when the exact wording differs. This dramatically reduces API costs and latency for repeated or similar queries.

Key Benefits:

Cost Reduction: Avoid expensive LLM API calls for similar requests
Improved Performance: Sub-millisecond cache retrieval vs multi-second API calls
Intelligent Matching: Semantic similarity beyond exact text matching
Streaming Support: Full streaming response caching with proper chunk ordering

Core Features

Dual-Layer Caching: Exact hash matching + semantic similarity search (customizable threshold)
Vector-Powered Intelligence: Uses embeddings to find semantically similar requests
Dynamic Configuration: Per-request TTL and threshold overrides via headers/context
Model/Provider Isolation: Separate caching per model and provider combination

Vector Store Setup

Semantic caching requires a configured vector store. DeepIntShield supports the following vector databases:

Weaviate: Production-ready vector database with gRPC support
Redis: High-performance in-memory vector store using RediSearch-compatible APIs (including Valkey bundles with FT.* support)
Qdrant: Rust-based vector search engine with advanced filtering
Pinecone: Managed vector database service with serverless options

To use your own vector store, choose the store type (Weaviate, Redis/Valkey, Qdrant, or Pinecone) and provide its connection details (for example, host and scheme for Weaviate) in the Web UI.

Semantic Cache Configuration

Semantic Cache Plugin Configuration

Note: Make sure you have a vector store set up before configuring the semantic cache plugin.

Navigate to Settings
- Open the DeepIntShield UI at https://app.deepintshield.com
- Go to Settings.
Configure Semantic Cache Plugin

Toggle the plugin switch to enable it, and fill in the required fields.

Required Fields:

Provider: The provider to use for caching.
Embedding Model: The embedding model to use for caching.

Other settings you can tune include the TTL, similarity threshold, conversation history threshold, and whether to cache by model and by provider.

TTL Format Options:

Duration strings: "30s", "5m", "1h", "24h"
Numeric seconds: 300 (5 minutes), 3600 (1 hour)

Note: Keys are taken from the provider config, so make sure to add the keys to the provider you specify here. Configuration changes may take a short time to take effect.

Direct Hash Mode (Embedding-Free)

Direct hash mode provides exact-match caching without requiring an embedding provider. Each request is hashed deterministically based on its normalized input, parameters, and stream flag. Identical requests produce cache hits; different wording is a cache miss.

When to use direct hash mode:

You only need exact-match deduplication (no fuzzy/semantic matching)
You cannot or do not want to call an external embedding API
You want the lowest possible latency with zero embedding overhead
Cost-sensitive environments where embedding API calls add up

Setup

To enable direct-only mode globally, omit the provider and keys fields from the plugin config. The plugin will automatically fall back to direct search only.

In the Web UI, enable the Semantic Cache plugin and leave the Provider and embedding key fields empty so it falls back to direct-only mode. Set the TTL and the cache by model / by provider options as needed. (For Helm-based deployments, the equivalent values are shown below.)

deepintshield:
  plugins:
    semanticCache:
      enabled: true
      config:
        dimension: 1
        ttl: "5m"
        cleanup_on_shutdown: true
        cache_by_model: true
        cache_by_provider: true

When initialized this way, all requests automatically use direct hash matching regardless of the x-bf-cache-type header. No embeddings are generated, and no embedding provider credentials are needed.

Recommended Vector Store

Redis/Valkey-compatible stores are recommended for direct hash mode. They do not require vectors for metadata-only entries, and all cache fields are indexed as TAG fields for fast exact-match lookups.

In the Web UI, set the vector store type to Redis and point it at your Redis/Valkey endpoint under Config → Caches → Vector Store. (For Helm-based deployments, the equivalent values are shown below.)

vectorStore:
  enabled: true
  type: redis
  redis:
    external:
      enabled: true
      host: "redis-or-valkey.example.com"
      port: 6379
      password: "your-redis-password"

Per-Request Cache Type Override

When the plugin is initialized without an embedding provider (direct-only mode), all requests use direct hash matching automatically. The x-bf-cache-type header has no effect.

When the plugin is initialized with an embedding provider (dual-layer mode), you can force direct-only matching on specific requests using the x-bf-cache-type: direct header. See Cache Type Control for details.

Set the cache key in the request header x-bf-cache-key:

# This request WILL be cached
curl -H "x-bf-cache-key: session-123" ...

# This request will NOT be cached (no header)
curl ...

Pass the cache key as an extra_header on the request:

from deepintshield import DeepintShield

shield = DeepintShield.from_env()
openai = shield.openai()

# This request WILL be cached
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={"x-bf-cache-key": "session-123"},
)

Per-Request Overrides

Workspace-level settings (configured under Governance Hub → Cost Optimization) are the default for every request. Any individual call can override them by sending one of the following headers - useful for batch jobs that need a longer TTL, evals that should bypass the cache, or hot paths that want a stricter similarity threshold.

Header	Effect
`x-bf-cache-ttl`	Override the semantic cache TTL for this request (e.g. `30s`, `5m`, `3600`).
`x-bf-cache-threshold`	Override the similarity threshold for this request (`0.0`–`1.0`).
`x-bf-cache-type`	Force a specific cache type for this request: `direct` or `semantic`.
`x-bf-cache-no-store`	Set to `true` to read from the cache but not write the new response back.
`x-bf-cache-key`	Provide an explicit cache key for direct hash matching (skips semantic lookup).

Override the default TTL and similarity threshold per request using the x-bf-cache-ttl and x-bf-cache-threshold headers:

# Custom TTL and threshold
curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-ttl: 30s" \
     -H "x-bf-cache-threshold: 0.9" ...

Advanced Cache Control

Cache Type Control

Control which caching mechanism to use per request with the x-bf-cache-type header:

# Direct hash matching only
curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-type: direct" ...

# Semantic similarity search only
curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-type: semantic" ...

# Default: Both (if header not specified)
curl -H "x-bf-cache-key: session-123" ...

No-Store Control

Disable response caching while still allowing cache reads with the x-bf-cache-no-store header:

# Read from cache but don't store response
curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-no-store: true" ...

Conversation Configuration

History Threshold Logic

The ConversationHistoryThreshold setting skips caching for conversations with many messages to prevent false positives:

Why this matters:

Semantic False Positives: Long conversation histories have high probability of semantic matches with unrelated conversations due to topic overlap
Direct Cache Inefficiency: Long conversations rarely have exact hash matches, making direct caching less effective
Performance: Reduces vector store load by filtering out low-value caching scenarios

{
  "conversation_history_threshold": 3  // Skip caching if > 3 messages in conversation
}

Recommended Values:

1-2: Very conservative (may miss valuable caching opportunities)
3-5: Balanced approach (default: 3)
10+: Cache longer conversations (higher false positive risk)

System Prompt Handling

Control whether system messages are included in cache key generation:

{
  "exclude_system_prompt": false  // Include system messages in cache key (default)
}

When to exclude (true):

System prompts change frequently but content is similar
Multiple system prompt variations for same use case
Focus caching on user content similarity

When to include (false):

System prompts significantly change response behavior
Each system prompt requires distinct cached responses
Strict response consistency requirements

Cache Management

Cache Metadata Location

When responses are served from semantic cache, cache metadata is automatically added to the response.

Location: extra_fields.cache_debug (a JSON object)

Fields:

cache_hit (boolean): true if the response was served from the cache, false when lookup fails.
hit_type (string): "semantic" for similarity match, "direct" for hash match
cache_id (string): Unique cache entry ID for management operations (present only for cache hits)

Semantic Cache Only:

provider_used (string): Provider used for calculating the semantic match embedding. (present for both cache hits and misses)
model_used (string): Model used for calculating the semantic match embedding. (present for both cache hits and misses)
input_tokens (number): Number of tokens extracted from the request for the semantic match embedding calculation. (present for both cache hits and misses)
threshold (number): Similarity threshold used for the match. (present only for cache hits)
similarity (number): Similarity score for the match. (present only for cache hits)

Example HTTP Response:

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "direct",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
    }
  }
}

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "semantic",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
      "threshold": 0.8,
      "similarity": 0.95,
      "provider_used": "openai",
      "model_used": "gpt-4o-mini",
      "input_tokens": 100
    }
  }
}

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": false,
      "provider_used": "openai",
      "model_used": "gpt-4o-mini",
      "input_tokens": 20
    }
  }
}

These variables allow you to detect cached responses and get the cache entry ID needed for clearing specific entries.

Clear Specific Cache Entry

You can clear cached entries from the Web UI. Each cached response exposes its cache_id under extra_fields.cache_debug (see Cache Metadata Location above) and was stored under a cache key. From Config → Caches in the Web UI you can clear a specific cached entry by its ID, or clear all entries for a given cache key.

Cache Lifecycle & Cleanup

The semantic cache automatically handles cleanup to prevent storage bloat:

Automatic Cleanup:

TTL Expiration: Entries are automatically removed when TTL expires
Shutdown Cleanup: All cache entries are cleared from the vector store namespace and the namespace itself when DeepIntShield client shuts down
Namespace Isolation: Each DeepIntShield instance uses isolated vector store namespaces to prevent conflicts

Manual Cleanup Options:

Clear specific entries by cache ID from the Web UI (see Clear Specific Cache Entry above)
Clear all entries for a cache key from the Web UI
Restart DeepIntShield to clear all cache data