Skip to content

Guardrails

Guardrails in DeepIntShield provide content safety, security validation, and policy enforcement for LLM requests and responses. The system validates inputs and outputs in real-time against your policies, protecting against harmful content, prompt injection, PII leakage, and policy violations.

Guardrails overview showing rules and profiles management

You set up guardrails with two building blocks. You create Profiles first, then write Rules that point at them:

Building blockWhat you do with it
ProfilesConnect an external guardrail provider (AWS Bedrock, Azure Content Safety, GraySwan, or Patronus AI) once with its credentials and detection settings. A profile is reusable - link it from as many rules as you like.
RulesDecide when and what to check, using a CEL (Common Expression Language) condition. A rule applies to inputs, outputs, or both, and links to one or more profiles. Link several profiles to one rule for layered (defense-in-depth) protection.
FeatureDescription
Multi-Provider SupportAWS Bedrock, Azure Content Safety, GraySwan, and Patronus AI integration
Dual-Stage ValidationGuard both inputs (prompts) and outputs (responses)
Real-Time ProcessingSynchronous and asynchronous validation modes
CEL-Based RulesDefine custom policies using Common Expression Language
Reusable ProfilesConfigure providers once, use across multiple rules
Sampling ControlApply rules to a percentage of requests for performance tuning
Automatic RemediationBlock, redact, or modify content based on policy
Comprehensive LoggingDetailed audit trails for compliance

Access Guardrails from the DeepIntShield dashboard:

PagePathDescription
ConfigurationGuardrails > ConfigurationManage guardrail rules and their settings
ProvidersGuardrails > ProvidersConfigure and manage guardrail profiles

When a request comes in, your input rules check the prompt before it reaches the provider, and your output rules check the response before it returns. A check that fails can block, redact, or modify the content based on the matching profile and policy.


DeepIntShield integrates with leading guardrail providers to offer comprehensive protection:

Amazon Bedrock Guardrails provides enterprise-grade content filtering and safety features with deep AWS integration.

AWS Bedrock Guardrails configuration form

Capabilities:

  • Content Filters: Hate speech, insults, sexual content, violence, misconduct
  • Denied Topics: Block specific topics or categories
  • Word Filters: Custom profanity and sensitive word blocking
  • PII Protection: Detect and redact 50+ PII entity types
  • Contextual Grounding: Verify responses against source documents
  • Prompt Attack Detection: Identify injection and jailbreak attempts
  • Image Content Support: Analyze images in addition to text (PNG, JPEG)

Configuration Fields:

FieldTypeRequiredDefaultDescription
access_keystringNo*-AWS Access Key ID
secret_keystringNo*-AWS Secret Access Key
bedrock_api_keystringNo*-Alternative Bedrock API key (Bearer token)
guardrail_arnstringYes-ARN of the Bedrock guardrail
guardrail_versionstringYes-Version of the guardrail (e.g., “1”, “DRAFT”)
regionstringYes-AWS region

Authentication Methods:

Uses AWS SDK with static credentials:

{
"access_key": "AKIAXXXXXXXXXXXXXXXXXX",
"secret_key": "your-secret-access-key",
"guardrail_arn": "arn:aws:bedrock:us-east-1:123456789:guardrail/abc123",
"guardrail_version": "1",
"region": "us-east-1"
}

Supported AWS Regions:

Region CodeRegion Name
us-east-1US East (N. Virginia)
us-east-2US East (Ohio)
us-west-1US West (N. California)
us-west-2US West (Oregon)
ap-south-1Asia Pacific (Mumbai)
ap-northeast-1Asia Pacific (Tokyo)
ap-northeast-2Asia Pacific (Seoul)
ap-southeast-1Asia Pacific (Singapore)
ap-southeast-2Asia Pacific (Sydney)
eu-central-1Europe (Frankfurt)
eu-west-1Europe (Ireland)
eu-west-2Europe (London)
eu-west-3Europe (Paris)

Supported Content Types:

  • Text content
  • Images (PNG, JPEG formats)

Usage Metrics Returned:

Bedrock guardrails return detailed usage metrics for cost tracking and monitoring:

MetricDescription
content_policy_unitsUnits consumed by content policy evaluation
contextual_grounding_policy_unitsUnits for grounding checks
sensitive_information_policy_unitsUnits for PII detection
topic_policy_unitsUnits for topic filtering
word_policy_unitsUnits for word filtering
automated_reasoning_policy_unitsUnits for reasoning checks
content_policy_image_unitsUnits for image content analysis

Supported PII Types:

  • Personal identifiers (SSN, passport, driver’s license)
  • Financial information (credit cards, bank accounts)
  • Contact information (email, phone, address)
  • Medical information (health records, insurance)
  • Device identifiers (IP addresses, MAC addresses)

Azure AI Content Safety provides multi-modal content moderation powered by Microsoft’s advanced AI models.

Azure Content Safety configuration form

Capabilities:

  • Severity-Based Filtering: 4-level severity classification (Safe, Low, Medium, High)
  • Multi-Category Detection: Hate, sexual, violence, self-harm content
  • Prompt Shield: Advanced jailbreak and injection detection
  • Indirect Attack Detection: Identify hidden malicious instructions
  • Protected Material: Detect copyrighted content (output only)
  • Custom Blocklists: Define organization-specific blocked terms

Configuration Fields:

FieldTypeRequiredDefaultDescription
endpointstringYes-Azure Content Safety endpoint URL
api_keystringYes-Azure subscription key
analyze_enabledbooleanNotrueEnable content analysis for Hate, Sexual, Violence, SelfHarm
analyze_severity_thresholdenumNo”medium”Severity level to trigger: low, medium, or high
jailbreak_shield_enabledbooleanNofalseEnable jailbreak detection (input only)
indirect_attack_shield_enabledbooleanNofalseEnable indirect prompt attack detection (input only)
copyright_enabledbooleanNofalseEnable copyrighted content detection (output only)
text_blocklist_enabledbooleanNofalseEnable custom blocklist filtering
blocklist_namesarrayNo-List of Azure blocklist names to apply

Severity Threshold Levels:

ThresholdNumeric ValueBehavior
low2Most strict - blocks severity 2 and above
medium4Balanced - blocks severity 4 and above
high6Least strict - blocks only severity 6

Detection Categories:

  • Hate and fairness
  • Sexual content
  • Violence
  • Self-harm

Patronus AI specializes in LLM security and safety with advanced evaluation capabilities.

Capabilities:

  • Hallucination Detection: Identify factually incorrect responses
  • PII Detection: Comprehensive personal data identification
  • Toxicity Screening: Multi-language toxic content detection
  • Prompt Injection Defense: Advanced attack pattern recognition
  • Custom Evaluators: Build organization-specific safety checks
  • Real-Time Monitoring: Continuous safety validation

Advanced Features:

  • Context-aware evaluation
  • Multi-turn conversation analysis
  • Custom policy templates
  • Integration with existing safety workflows

GraySwan Cygnal Monitor provides AI safety monitoring with natural language rule definitions and advanced threat detection capabilities.

GraySwan configuration form

Capabilities:

  • Violation Scoring: Continuous 0-1 scale violation detection with configurable thresholds
  • Custom Natural Language Rules: Define safety rules in plain English without code
  • Policy Management: Use pre-built policies from GraySwan platform or create custom ones
  • Indirect Prompt Injection (IPI) Detection: Identify hidden instructions in user inputs
  • Mutation Detection: Detect attempts to manipulate or alter content
  • Reasoning Modes: Choose from fast (“off”), balanced (“hybrid”), or thorough (“thinking”) analysis

Configuration Fields:

FieldTypeRequiredDefaultDescription
api_keystringYes-GraySwan API key
violation_thresholdnumberNo0.5Score threshold (0-1) for triggering intervention. Lower values are more strict.
reasoning_modeenumNo”off”Analysis depth: off (fastest), hybrid (balanced), or thinking (most thorough)
policy_idstringNo-Single custom policy ID from GraySwan platform
policy_idsarrayNo-Multiple policy IDs for aggregated rule evaluation
rulesobjectNo-Custom natural language rules as key-value pairs

Custom Rules Example:

GraySwan custom rules

Rules are defined as key-value pairs where the key is the rule name and the value is a natural language description:

{
"rules": {
"no_profanity": "Do not allow profanity or vulgar language",
"no_pii": "Do not allow personally identifiable information",
"professional_tone": "Ensure all responses maintain a professional tone"
}
}

Detection Features:

  • Real-time violation scoring
  • Multi-rule evaluation
  • IPI attack detection
  • Content mutation monitoring
  • Detailed violation descriptions with rule attribution

Guardrail Rules are custom policies that define when and how content validation occurs. Rules use CEL (Common Expression Language) expressions to evaluate requests and can be linked to one or more profiles for execution.

Guardrail rules list showing configured rules with status and actions
PropertyTypeRequiredDescription
idintegerYesUnique identifier for the rule
namestringYesDescriptive name for the rule
descriptionstringNoExplanation of what the rule does
enabledbooleanYesWhether the rule is active
cel_expressionstringYesCEL expression for rule evaluation
apply_toenumYesWhen to apply: input, output, or both
sampling_rateintegerNoPercentage of requests to evaluate (0-100)
timeoutintegerNoExecution timeout in milliseconds
provider_config_idsarrayNoIDs of profiles to use for evaluation
  1. Navigate to Rules
    • Go to Guardrails > Configuration
    • Click Add Rule
Guardrail rules list showing configured rules with status and actions
  1. Configure Rule Settings

Basic Information:

  • Name: Enter a descriptive name (e.g., “Block PII in Prompts”)
  • Description: Explain the rule’s purpose
  • Enabled: Toggle to activate the rule

Evaluation Settings:

  • Apply To: Select when to apply the rule
    • input - Validate incoming prompts only
    • output - Validate LLM responses only
    • both - Validate both inputs and outputs
  • CEL Expression: Define the validation logic
  • Sampling Rate: Set percentage of requests to evaluate (default: 100%)
  • Timeout: Set maximum execution time in milliseconds
  1. Link Profiles

    • Select one or more profiles to use for evaluation
    • Rules will execute all linked profiles in sequence
  2. Save and Test

    • Click Save Rule
    • Use the Test button to validate with sample content

CEL (Common Expression Language) provides a powerful way to define rule conditions. Here are common patterns:

Always Apply Rule:

true

Apply to User Messages Only:

request.messages.exists(m, m.role == "user")

Apply to Messages Containing Keywords:

request.messages.exists(m, m.content.contains("confidential"))

Apply Based on Model:

request.model.startsWith("gpt-4")

Apply to Long Prompts:

request.messages.filter(m, m.role == "user").map(m, m.content.size()).sum() > 1000

Combine Multiple Conditions:

request.model.startsWith("gpt-4") && request.messages.exists(m, m.role == "user" && m.content.size() > 500)

Rules can be linked to multiple profiles for comprehensive validation:

Rule configuration showing linked profiles

Best Practices:

  • Link PII detection rules to profiles with PII capabilities (Bedrock, Patronus)
  • Link content filtering rules to profiles with content safety features (Azure, Bedrock, GraySwan)
  • Use GraySwan for custom natural language rules when you need flexible, readable policies
  • Use multiple profiles for defense-in-depth (e.g., Bedrock + Patronus for PII, Azure + GraySwan for content)
  • Set appropriate timeouts when using multiple profiles

Profiles are reusable configurations for external guardrail providers. Each profile contains provider-specific settings including credentials, endpoints, and detection thresholds.

Guardrail profiles list showing configured providers
PropertyTypeRequiredDescription
idintegerYesUnique identifier for the profile
provider_namestringYesProvider type: bedrock, azure, grayswan, patronus_ai
policy_namestringYesDescriptive name for the policy
enabledbooleanYesWhether the profile is active
configobjectNoProvider-specific configuration
  1. Navigate to Providers
    • Go to Guardrails > Providers
    • Click Add Profile
Create guardrail profile form
  1. Select Provider Type

    • Choose from: AWS Bedrock, Azure Content Safety, GraySwan, or Patronus AI
  2. Configure Provider Settings

    • Enter credentials and endpoint information
    • Configure detection thresholds and actions
    • See provider-specific setup sections above for detailed configuration
  3. Save Profile

    • Click Save Profile
    • The profile is now available for linking to rules

Each provider offers different capabilities. Choose profiles based on your validation needs:

CapabilityAWS BedrockAzure Content SafetyGraySwanPatronus AI
PII DetectionYesNoNoYes
Content FilteringYesYesYesYes
Prompt InjectionYesYesYesYes
Hallucination DetectionNoNoNoYes
Toxicity ScreeningYesYesYesYes
Custom PoliciesYesYesYesYes
Custom Natural Language RulesNoNoYesNo
Image SupportYesNoNoNo
IPI DetectionNoYesYesNo
Mutation DetectionNoNoYesNo

Profile Organization:

  • Create separate profiles for different use cases (PII, content filtering, etc.)
  • Use descriptive policy names that indicate the profile’s purpose
  • Keep credentials secure using environment variables

Performance Considerations:

  • Enable only the profiles you need to minimize latency
  • Use sampling rates on rules for high-traffic endpoints
  • Set appropriate timeouts to prevent slow requests

Security:

  • Store API keys and credentials in environment variables or secrets managers
  • Regularly rotate credentials
  • Use least-privilege IAM roles for AWS Bedrock

In production, guardrails are attached to a virtual key in the dashboard (Config → Guardrails) and run automatically on every request made with that key - there is no per-request header to set. Just call the gateway as usual:

Terminal window
curl -X POST https://app.deepintshield.com/v1/chat/completions \
-H "Authorization: Bearer sk-bf-your-virtual-key" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{ "role": "user", "content": "Help me with this task" }
]
}'

The input guardrails bound to that key run before the prompt reaches the provider, and any output guardrails run on the response before it is returned.

Note: Passing raw guardrail definitions inline on a request (x-bf-input-guardrails / x-bf-output-guardrails) is restricted to the dashboard test-lab simulation flow. For live traffic, bind guardrails to the virtual key as shown above.

DeepintShield reports the guardrail result on response headers, not in the response body:

HeaderValuesMeaning
x-deepintshield-guardrail-statuspass, blocked, redacted, flaggedOutcome of the guardrail evaluation
x-deepintshield-guardrail-modesync, async, shadowExecution mode the guardrail ran in

Passed (HTTP 200) - the normal completion is returned, with x-deepintshield-guardrail-status: pass:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"message": { "role": "assistant", "content": "I'd be happy to help..." },
"finish_reason": "stop"
}
]
}

Blocked (HTTP 403) - the request is rejected with a guardrail_blocked error and x-deepintshield-guardrail-status: blocked:

{
"error": {
"type": "guardrail_blocked",
"code": "guardrail_blocked",
"message": "Request blocked by guardrail policy"
}
}

When a guardrail is configured to redact rather than block, the completion is returned (HTTP 200) with the offending content removed and x-deepintshield-guardrail-status: redacted. In monitor / shadow mode the request is never blocked - the violation is only recorded - and the status header reflects flagged.

By default the guardrail engine evaluates text, chat, responses and passthrough requests. Multimodal guardrails extend the same policy engine, decision path and x-deepintshield-guardrail-* headers to non-text endpoints - so image, audio and video traffic is governed exactly like text, with one consistent verdict model.

EndpointRequest typeWhat is evaluated
Image generationimage_generationThe prompt (and negative prompt)
Image editimage_editThe prompt + text embedded in the source images (PNG/JPEG metadata, OCR-style strings)
Speech / TTSspeechThe text to synthesize + voice instructions
TranscriptiontranscriptionThe produced transcript (output side)
Video generationvideo_generationThe prompt (and negative prompt)
EmbeddingsembeddingThe input text(s)
RerankrerankThe query + candidate documents

The text these requests already carry is guarded by the existing text detectors. Binary artifacts (source images, generated images, audio, video) are forwarded to the guard runtime as attachments - fingerprinted with a content hash for deduplication - where a modality-extraction stage resolves them to text (document/OCR text today; OCR/STT/keyframe extractors are pluggable) and optional modality detectors (vision/audio safety models) score them natively.

Multimodal guarding is off by default - when disabled, behavior is identical to text-only guarding and there is no added latency on any request. Enable it with environment variables (Helm: deepintshield.guardrails.*):

VariableComponentEffect
GUARDRAILS_MULTIMODALgatewayGate image/audio/video/embedding/rerank requests into the engine and guard the text they carry; forward artifacts as attachments
GUARDRAILS_MODALITY_INLINE_BYTESgatewayInline raw artifact bytes into the guard request so the extraction stage can process them (otherwise only metadata + content hash are sent)
GUARDRAILS_STREAM_ACCUMULATEgatewayGuard streamed output on a growing window so a violation split across chunks is caught
DEEPINTSHIELD_GUARD_MODALITY_EXTRACTguard runtimeRun the modality-extraction / detector stage over forwarded attachments
  • Zero added latency when off, and near-zero when on for the common case: the text already present (prompts, transcripts) reuses the existing text path.
  • Bounded work - per-request attachment count, per-asset byte size and extracted-text length are all capped, and identical assets are deduplicated by content hash, so a large image or video can never tie up a worker.
  • Scales horizontally - heavy extraction/detection runs in the separately deployable guard runtime, off the gateway hot path.

Multimodal activity appears on the Guardrail Metrics → Multimodal tab (workspace analytics): a decision timeline by modality, request distribution, decision breakdown, and attachment-level findings - using the same filters and date range as the other guardrail dashboards.