Performance Tuning

Overview

DeepIntShield provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior:

Parameter	Scope	Default	Description
Concurrency	Per Provider	1000	Number of requests processed simultaneously for a provider
Buffer Size	Per Provider	5000	Maximum requests that can be queued before blocking/dropping
Initial Pool Size	Global	5000	Objects pre-allocated at startup to keep latency consistent under load

Understanding the Parameters

Concurrency (Per Provider)

What it does: Controls how many requests DeepIntShield processes in parallel for each provider. Higher concurrency increases throughput; lower concurrency reduces resource usage and respects provider rate limits.

Impact:

Higher concurrency = More parallel requests to the provider, higher throughput
Lower concurrency = Fewer parallel requests, lower resource usage, respects provider rate limits

Default: 1000 concurrent requests per provider

{
    "providers": {
        "openai": {
            "keys": [],
            "concurrency_and_buffer_size": {
                "concurrency": 100,
                "buffer_size": 500
            }
        }
    }
}

Buffer Size (Per Provider)

What it does: Sets the capacity of the request queue for each provider. Incoming requests are held here until DeepIntShield has capacity to process them.

Impact:

Larger buffer = More requests can be queued during traffic spikes, handles burst traffic better
Smaller buffer = Lower memory footprint, faster backpressure signals to clients

Default: 5000 requests per provider queue

Queue Full Behavior: Controlled by drop_excess_requests:

false (default): New requests block until queue space is available
true: New requests are immediately dropped with an error when queue is full

Initial Pool Size (Global)

What it does: Controls how many objects DeepIntShield pre-allocates at startup to keep latency consistent under high traffic.

Impact:

Higher initial pool = More consistent latency during high traffic, higher initial memory usage
Lower initial pool = Lower initial memory footprint, may add latency under sudden load

Default: 5000 objects

{
    "client": {
        "initial_pool_size": 10000,
        "drop_excess_requests": false
    }
}

Sizing Guidelines

Concurrency & Buffer Size (Per Provider)

Configure these settings per provider based on the expected RPS for that specific provider:

Provider RPS	Concurrency	Buffer Size
100	100	150
500	500	750
1000	1000	1500
2500	2500	3750
5000	5000	7500
10000	10000	15000

Formula:

concurrency = expected_rps
buffer_size = 1.5 × expected_rps

This ratio ensures:

Enough queue capacity to absorb traffic bursts
Processing capacity is never starved for work
Backpressure is applied before memory exhaustion

Initial Pool Size (Global)

Configure this setting based on total RPS across all providers combined:

Total RPS (All Providers)	Initial Pool Size	Memory Estimate
100	150	~50 MB
500	750	~100 MB
1000	1500	~200 MB
2500	3750	~400 MB
5000	7500	~800 MB
10000	15000	~1.5 GB

Formula:

initial_pool_size = 1.5 × total_expected_rps

Additionally, ensure:

initial_pool_size >= max(buffer_size across all providers)

This ensures pools are pre-warmed to handle peak queue depths without runtime allocations.

Multi-Node Deployments

When running multiple DeepIntShield instances behind a load balancer, divide the per-node settings by the number of nodes based on your total expected RPS.

Formula

Per-Node Concurrency = Total Concurrency / Number of Nodes
Per-Node Buffer Size = Total Buffer Size / Number of Nodes
Per-Node Initial Pool Size = Total Initial Pool Size / Number of Nodes

Example: 10,000 RPS Across 4 Nodes

Total capacity (aggregate across all 4 nodes):

Total RPS: 10,000 RPS
Per-node RPS: ~2,500 RPS per node

Single node settings for 10,000 RPS (if running on one node):

Concurrency: 10000
Buffer Size: 15000
Initial Pool Size: 15000

Per-node settings (4 nodes, 10,000 RPS total):

Parameter	Total (Aggregate)	Per Node (4 nodes)
Concurrency	10000	2500
Buffer Size	15000	3750
Initial Pool Size	15000	3750

{
    "client": {
        "initial_pool_size": 3750,
        "drop_excess_requests": false
    },
    "providers": {
        "openai": {
            "keys": [],
            "concurrency_and_buffer_size": {
                "concurrency": 2500,
                "buffer_size": 3750
            }
        },
        "anthropic": {
            "keys": [],
            "concurrency_and_buffer_size": {
                "concurrency": 2500,
                "buffer_size": 3750
            }
        }
    }
}

Provider-Specific Tuning

Different providers have different rate limits and latency characteristics. Tune each provider independently:

Provider Rate Limit Considerations

Provider	Typical Rate Limits	Recommended Concurrency	Notes
OpenAI	500-10000 RPM (varies by tier)	100-500	Higher tiers support more concurrency
Anthropic	1000-4000 RPM (varies by tier)	50-200	More conservative rate limits
Bedrock	Per-model limits	100-300	Check AWS quotas for your account
Azure OpenAI	Deployment-specific	100-500	Configure per-deployment
Vertex AI	Per-model quotas	100-300	Check GCP quotas
Groq	Very high throughput	500-1000	Designed for high concurrency
Ollama	Local resource bound	10-50	Limited by local GPU/CPU

Example: Mixed Provider Configuration

{
    "providers": {
        "openai": {
            "keys": [],
            "concurrency_and_buffer_size": {
                "concurrency": 200,
                "buffer_size": 1000
            }
        },
        "anthropic": {
            "keys": [],
            "concurrency_and_buffer_size": {
                "concurrency": 100,
                "buffer_size": 500
            }
        },
        "groq": {
            "keys": [],
            "concurrency_and_buffer_size": {
                "concurrency": 500,
                "buffer_size": 2500
            }
        },
        "ollama": {
            "keys": [],
            "concurrency_and_buffer_size": {
                "concurrency": 20,
                "buffer_size": 100
            }
        }
    }
}

Queue Overflow Handling

When the provider queue reaches capacity, DeepIntShield’s behavior is controlled by drop_excess_requests:

Blocking Mode (Default)

{
    "client": {
        "drop_excess_requests": false
    }
}

New requests wait until queue space is available
Ensures no requests are lost
May increase latency during high load
Suitable for critical workloads where every request matters

Drop Mode

{
    "client": {
        "drop_excess_requests": true
    }
}

New requests are immediately rejected when queue is full
Returns error: "request dropped: queue is full"
Maintains consistent latency for accepted requests
Suitable for real-time applications where stale requests are useless

Monitoring and Diagnostics

Key Metrics to Monitor

Metric	Healthy Range	Action if Exceeded
Queue depth	< 50% of buffer_size	Increase buffer or concurrency
Request latency (p99)	< 2x average	Check provider rate limits
Dropped requests	0	Increase buffer_size
Memory usage	Stable	Reduce pool/buffer sizes

Health Check Endpoint

The Gateway exposes health and metrics endpoints:

# Health check
curl https://app.deepintshield.com/health

# Prometheus metrics
curl https://app.deepintshield.com/metrics

Best Practices Summary

Start Conservative

Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.

Monitor Continuously

Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns.

Match Provider Limits

Don’t set concurrency higher than provider rate limits allow. You’ll just get rate-limited.

Plan for Bursts

Set buffer_size to 1.5x concurrency to handle traffic spikes without dropping requests.

Quick Reference

// Formula
concurrency      = expected_rps
buffer_size      = 1.5 × expected_rps
initial_pool_size = 1.5 × total_rps (across all providers)

// Example: 500 RPS per provider, 2 providers (1000 total RPS)
concurrency: 500, buffer_size: 750, initial_pool_size: 1500

// Example: 2000 RPS per provider, 3 providers (6000 total RPS)
concurrency: 2000, buffer_size: 3000, initial_pool_size: 9000

// Multi-node formula
per_node_value = total_value / number_of_nodes

Provider Configuration - Complete provider setup guide
Custom Providers - Creating custom provider integrations
Deployment - Production deployment guides

Performance Tuning

Overview

Understanding the Parameters

Concurrency (Per Provider)

Buffer Size (Per Provider)

Initial Pool Size (Global)

Sizing Guidelines

Concurrency & Buffer Size (Per Provider)

Initial Pool Size (Global)

Multi-Node Deployments

Formula

Example: 10,000 RPS Across 4 Nodes

Provider-Specific Tuning

Provider Rate Limit Considerations

Example: Mixed Provider Configuration

Queue Overflow Handling

Blocking Mode (Default)

Drop Mode

Monitoring and Diagnostics

Key Metrics to Monitor

Health Check Endpoint

Best Practices Summary

Quick Reference

Related Documentation