Skip to content

Speculative dispatch

Speculative dispatch fires the provider call in parallel with input-guard evaluation. On the allow path (>95% of typical traffic), the user sees max(guards, model) latency instead of guards + model.

For most workloads this is the single biggest latency win available.

Off (opt-in). This is a safety tradeoff - see below. Enable per-deployment when you’re comfortable with the semantics.

Terminal window
# Plugin config (preferred, live-reloadable):
{ "speculative_input_guards": true }
# Or env var:
DEEPINTSHIELD_GUARD_SPECULATIVE_INPUT_GUARDS=true

With this enabled, the provider call and input-guard checks run at the same time. The response is only released once the guard verdict is in.

  • On the allow path: total latency is max(guards, model) instead of guards + model.
  • On the deny path: the model’s response is discarded and the user sees the standard guardrail_blocked error. You pay for one wasted provider call - that’s the safety vs. latency tradeoff.
  • Latency-sensitive chat and copilot UIs where a 200–800ms reduction is visible.
  • Workloads with low deny rates (typical: <5%) so the wasted-provider-call cost is small.
  • Workloads that don’t rely on input redaction for safety.
  • Audit-heavy workloads where every guarded request must complete before the provider call even starts (some regulated environments).
  • Workloads with high deny rates (>20%) where the wasted-call cost is real.
  • Streaming responses - speculative dispatch is automatically disabled for streaming, since there’s no clean way to discard tokens after the first one hits the wire.