Start Conservative
Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.
DeepIntShield provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior:
| Parameter | Scope | Default | Description |
|---|---|---|---|
| Concurrency | Per Provider | 1000 | Number of requests processed simultaneously for a provider |
| Buffer Size | Per Provider | 5000 | Maximum requests that can be queued before blocking/dropping |
| Initial Pool Size | Global | 5000 | Objects pre-allocated at startup to keep latency consistent under load |
What it does: Controls how many requests DeepIntShield processes in parallel for each provider. Higher concurrency increases throughput; lower concurrency reduces resource usage and respects provider rate limits.
Impact:
Default: 1000 concurrent requests per provider
{ "providers": { "openai": { "keys": [], "concurrency_and_buffer_size": { "concurrency": 100, "buffer_size": 500 } } }}What it does: Sets the capacity of the request queue for each provider. Incoming requests are held here until DeepIntShield has capacity to process them.
Impact:
Default: 5000 requests per provider queue
Queue Full Behavior: Controlled by drop_excess_requests:
false (default): New requests block until queue space is availabletrue: New requests are immediately dropped with an error when queue is fullWhat it does: Controls how many objects DeepIntShield pre-allocates at startup to keep latency consistent under high traffic.
Impact:
Default: 5000 objects
{ "client": { "initial_pool_size": 10000, "drop_excess_requests": false }}Configure these settings per provider based on the expected RPS for that specific provider:
| Provider RPS | Concurrency | Buffer Size |
|---|---|---|
| 100 | 100 | 150 |
| 500 | 500 | 750 |
| 1000 | 1000 | 1500 |
| 2500 | 2500 | 3750 |
| 5000 | 5000 | 7500 |
| 10000 | 10000 | 15000 |
Formula:
concurrency = expected_rpsbuffer_size = 1.5 × expected_rpsThis ratio ensures:
Configure this setting based on total RPS across all providers combined:
| Total RPS (All Providers) | Initial Pool Size | Memory Estimate |
|---|---|---|
| 100 | 150 | ~50 MB |
| 500 | 750 | ~100 MB |
| 1000 | 1500 | ~200 MB |
| 2500 | 3750 | ~400 MB |
| 5000 | 7500 | ~800 MB |
| 10000 | 15000 | ~1.5 GB |
Formula:
initial_pool_size = 1.5 × total_expected_rpsAdditionally, ensure:
initial_pool_size >= max(buffer_size across all providers)This ensures pools are pre-warmed to handle peak queue depths without runtime allocations.
When running multiple DeepIntShield instances behind a load balancer, divide the per-node settings by the number of nodes based on your total expected RPS.
Per-Node Concurrency = Total Concurrency / Number of NodesPer-Node Buffer Size = Total Buffer Size / Number of NodesPer-Node Initial Pool Size = Total Initial Pool Size / Number of NodesTotal capacity (aggregate across all 4 nodes):
Single node settings for 10,000 RPS (if running on one node):
Per-node settings (4 nodes, 10,000 RPS total):
| Parameter | Total (Aggregate) | Per Node (4 nodes) |
|---|---|---|
| Concurrency | 10000 | 2500 |
| Buffer Size | 15000 | 3750 |
| Initial Pool Size | 15000 | 3750 |
{ "client": { "initial_pool_size": 3750, "drop_excess_requests": false }, "providers": { "openai": { "keys": [], "concurrency_and_buffer_size": { "concurrency": 2500, "buffer_size": 3750 } }, "anthropic": { "keys": [], "concurrency_and_buffer_size": { "concurrency": 2500, "buffer_size": 3750 } } }}Different providers have different rate limits and latency characteristics. Tune each provider independently:
| Provider | Typical Rate Limits | Recommended Concurrency | Notes |
|---|---|---|---|
| OpenAI | 500-10000 RPM (varies by tier) | 100-500 | Higher tiers support more concurrency |
| Anthropic | 1000-4000 RPM (varies by tier) | 50-200 | More conservative rate limits |
| Bedrock | Per-model limits | 100-300 | Check AWS quotas for your account |
| Azure OpenAI | Deployment-specific | 100-500 | Configure per-deployment |
| Vertex AI | Per-model quotas | 100-300 | Check GCP quotas |
| Groq | Very high throughput | 500-1000 | Designed for high concurrency |
| Ollama | Local resource bound | 10-50 | Limited by local GPU/CPU |
{ "providers": { "openai": { "keys": [], "concurrency_and_buffer_size": { "concurrency": 200, "buffer_size": 1000 } }, "anthropic": { "keys": [], "concurrency_and_buffer_size": { "concurrency": 100, "buffer_size": 500 } }, "groq": { "keys": [], "concurrency_and_buffer_size": { "concurrency": 500, "buffer_size": 2500 } }, "ollama": { "keys": [], "concurrency_and_buffer_size": { "concurrency": 20, "buffer_size": 100 } } }}When the provider queue reaches capacity, DeepIntShield’s behavior is controlled by drop_excess_requests:
{ "client": { "drop_excess_requests": false }}{ "client": { "drop_excess_requests": true }}"request dropped: queue is full"| Metric | Healthy Range | Action if Exceeded |
|---|---|---|
| Queue depth | < 50% of buffer_size | Increase buffer or concurrency |
| Request latency (p99) | < 2x average | Check provider rate limits |
| Dropped requests | 0 | Increase buffer_size |
| Memory usage | Stable | Reduce pool/buffer sizes |
The Gateway exposes health and metrics endpoints:
# Health checkcurl https://app.deepintshield.com/health
# Prometheus metricscurl https://app.deepintshield.com/metricsStart Conservative
Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.
Monitor Continuously
Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns.
Match Provider Limits
Don’t set concurrency higher than provider rate limits allow. You’ll just get rate-limited.
Plan for Bursts
Set buffer_size to 1.5x concurrency to handle traffic spikes without dropping requests.
// Formulaconcurrency = expected_rpsbuffer_size = 1.5 × expected_rpsinitial_pool_size = 1.5 × total_rps (across all providers)
// Example: 500 RPS per provider, 2 providers (1000 total RPS)concurrency: 500, buffer_size: 750, initial_pool_size: 1500
// Example: 2000 RPS per provider, 3 providers (6000 total RPS)concurrency: 2000, buffer_size: 3000, initial_pool_size: 9000
// Multi-node formulaper_node_value = total_value / number_of_nodes