Agentic Observability

Overview

Agentic Observability gives you a complete, self-hosted view of what your agents actually did. Each agent run is grouped into a single timeline that captures every LLM call (with tokens and cost) and every tool call (with its allow / deny / mask / approval verdict, policy, obligations, and decision latency). On top of that timeline, the platform attaches per-run quality scores so you can track agent behavior over time.

Everything stays inside your own deployment. Traces are authoritative locally, and you can optionally fan them out to a self-hosted Langfuse instance or any OTLP backend you already run - without shipping raw prompt or completion text off your network. Arguments are recorded as digests only, so you get full visibility without creating a new data-exfiltration surface.

Telemetry is asynchronous, so turning observability on adds no measurable latency to your agents.

Key benefits

One timeline per agent run - answer “what did this agent actually do?” without standing up extra infrastructure.
Security-native traces - each tool call carries its verdict, policy, enforcement context, and obligations alongside cost and token usage, so you debug agent behavior and policy enforcement in the same place.
Per-run quality scores - deterministic platform scores plus an optional LLM-as-judge surface intent, integrity, latency, and red-team signals.
No vendor lock-in - export byte-identical spans to a self-hosted Langfuse and/or any OTLP backend at the same time.
Zero-data-retention by default - arguments are sent as digests, and a redaction pass strips sensitive attributes before export.

When to use it

Turn on Agentic Observability when you want to:

Investigate why a specific agent run was blocked, masked, or sent for approval, step by step.
Track per-run and per-virtual-key LLM cost and token usage for FinOps reporting.
Monitor agent quality and integrity trends (intent match, tool integrity, latency budget) over time.
Feed agent traces and scores into a Langfuse project or your existing OpenTelemetry backend (Collector, Tempo, Honeycomb, Datadog OTLP) without re-instrumenting your agents.
Visualize which agents call which tools, and where denials cluster, using the service graph.

This is the operational, sampled view of your agents. It complements the immutable, hash-chained agent decision audit used for compliance, which remains a separate store.

Where to view it

Open Agentic Observability from the workspace sidebar. It has three sections, plus a dedicated analytics dashboard.

Overview

The Overview page shows workspace-scoped KPI tiles - total runs, active runs, average latency, total cost, tool calls (with allow / deny counts), and blocked runs - followed by a Recent runs table. Each tile deep-links into a pre-filtered traces list or the analytics dashboard. Click any run to open its timeline.

Agent Traces

The Agent Traces list shows one row per run: when it started, the primary tool and agent chain, the caller (tenant + virtual key name), tool-call count, a verdict breakdown (allow / deny / mask / approval), latency, cost, and run status. Filter the list by status - All, Running, Awaiting, Blocked, Complete, or Error.

Open a trace for its unified execution view (a single screen - no tabs):

Summary header - caller, virtual key, primary tool and model, observation count with verdict tallies, status, latency, cost, tokens.
Execution map - the run’s “agentic fingerprint”: each step is a node labelled with the actual tool / action name (not “Step N”) and coloured by its verdict (allow / deny / approval / mask); parallel agents fan out from their shared parent. Toggle to a list view for the indented timeline. Click a node to inspect it.
Step inspector - for the selected step: verdict, risk score (0–100 + band), policy, reason, decision latency, the OWASP finding with its NIST / ISO 42001 / EU AI Act / MITRE ATLAS crosswalk, obligations, args digest, and - when capture is enabled - the masked prompt / input.
Agent identity card - the acting agent’s authn binding (identity provider, virtual key, directory), ownership from the registry (owner, team, creator, version), and its risk score; a Register call-to-action when the agent is a shadow (unregistered).
Audit trail - the run’s hash-chained, tamper-evident decision records (sequence · tool · verdict · policy · prev_hash → hash) with a continuity-verified badge. This is the compliance system-of-record, distinct from the sampled observability timeline.
Langfuse insights - when a Langfuse endpoint is configured, server-computed model, token, cost, tags, and scores, with an Open in Langfuse deep link.
Local scores - per-run scores aggregated by metric, with sample count, average, and range.

Scores

Open Analytics → Agentic Observability and select the Scores tab to see quality scores across all runs in the selected time window. Four metrics are tracked:

Score	What it measures	Source	Tier
`latency_budget`	How well a decision stayed within its per-tool added-latency budget.	Platform (deterministic)	All
`tool_integrity`	`1.0` when a tool call matched its declared contract, decaying as divergence risk rises.	Platform (deterministic)	All
`intent_match`	Whether the call matches the stated task intent.	Platform / LLM	Business
`llm_judge`	Optional response-quality scoring by a sampled, locally-run LLM judge.	LLM-as-judge	Business
`redteam_containment`	Containment against a red-team corpus mapped to the OWASP Agentic Top 10.	External ingest	Business

The Scores tab renders a rolling-average trend, a score distribution, and a Score vs status correlation so you can see how quality relates to blocked or errored runs. The deterministic platform scores (latency_budget, tool_integrity) cost nothing on the hot path and incur no LLM spend.

Service graph (agent-to-tool map)

Under Agentic Security → Service Graph, you get a workspace-scoped “who-calls-whom” map: agent (caller) nodes connected to tool nodes, plus agent-to-agent delegation hops. Every edge is colored by its dominant verdict, so denials and approval-gated paths stand out at a glance. Pick a time window (last hour, 24 hours, or 7 days), drag nodes to rearrange, click a node to inspect its verdict mix and busiest neighbors, and double-click to jump to the related tool or approvals page. From an agent node you can deep-link straight to its Traces.

The analytics dashboard also includes a Volume, Cost & Tokens, Latency, Status, Agents & VKs, and Observations tab, with server-aggregated charts (top cost traces, top virtual keys, agent chains, sessions, models, step types, and error reasons) that stay accurate at any row count.

Posture, Findings & In-Line Control

The Agentic sidebar group opens on Executions (the runs hub) and carries a sub-tab strip to the security surfaces, all derived from the same decisions - read-only, off the decision hot path:

Posture - the agentic equivalent of a posture-gaps dashboard: detected gaps grouped by OWASP Agentic category, risk-scored, with KPI tiles, a severity donut, and a Compliance coverage card that aggregates the distinct OWASP / NIST AI RMF / ISO 42001 / EU AI Act / MITRE ATLAS controls touched by your flagged decisions.
Findings - every flagged action across the workspace as a triage card: severity, OWASP mapping, the compliance crosswalk, decision evidence, and why this is a risk / how to investigate / how to respond. Filter by severity or search by tool / OWASP / agent.
In-Line Control - every action that was blocked or held for approval, as an enforcement table (rule · prevention type · agent · tool · time) with a drill-in taint/integrity flow and the full decision evidence.

These surfaces never add latency: they aggregate the decision and observation rows the audit and analytics already write.

Configuration

Configuration splits into two parts: per-workspace governance, sampling, and score toggles you control in the Web UI, and the deployment-level export transport (Langfuse / OTLP endpoint, keys, and headers) set in the server environment.

Web UI
Server environment

Go to Agentic Observability → Settings.

Governance & sampling

Mask / redact before ingest - apply masking and redaction before any span leaves the platform.
Args sent as digest only - record tool arguments as digests, never raw values.
Capture prompt preview (opt-in) - off by default (preserving zero-data-retention). When on, a PII-masked, ≤480-character snippet of each step’s prompt / input is stored so the execution view can show what was asked per action. Without it (or a connected Langfuse), prompt text is simply not shown.
Security spans sample rate (%) - what fraction of decision spans to capture. Keep this at 100% so verdicts are never missed.
Payload sample rate (%) - what fraction of full-payload spans to capture (default 20%).
Dual-export to metrics / Jaeger - also emit to your metrics / tracing pipeline.

Scores & evaluations - toggle which scores attach to traces:

Red-team containment
Intent-match
Latency-budget
LLM-as-judge (custom) - optional; off by default.

AI tool summaries - optionally enable an LLM-written “what this tool does” summary (shown in Approvals and Tools). When enabled, pick a Virtual API key and a Model from that key. The summary call runs asynchronously, off the decision hot path; without a key/model the deterministic, schema-derived summary is always shown.

Use Test connection to probe the configured Langfuse endpoint, then Save. Changes apply on the next decision span this workspace emits.

The export transport is set with server environment variables. Per-workspace toggles (sampling, scores, masking) remain editable in the Settings UI.

Self-hosted Langfuse export

# Base URL of your in-cluster Langfuse v3 instance
DEEPINTSHIELD_LANGFUSE_ENDPOINT=http://langfuse-web.langfuse.svc.cluster.local:3000

# Optional: keys used to read back model / token / cost enrichment
DEEPINTSHIELD_LANGFUSE_PUBLIC_KEY=pk-lf-...
DEEPINTSHIELD_LANGFUSE_SECRET_KEY=sk-lf-...

# Optional: pre-encoded Basic auth string for OTel ingestion
DEEPINTSHIELD_LANGFUSE_AUTH_BASIC=...

When DEEPINTSHIELD_LANGFUSE_ENDPOINT is set, agent runs stream to Langfuse asynchronously, and the trace detail page enriches each run with Langfuse-side pricing, tags, and scores. Leave it unset to keep the local trace store as the only copy.

Vendor-neutral OTLP fan-out

Point at any OTLP/HTTP backend - an OpenTelemetry Collector, Grafana Tempo, Honeycomb, or Datadog OTLP. When set, spans are exported there in addition to Langfuse, so traces are portable to whatever you already run.

# OTLP/HTTP collector base URL
DEEPINTSHIELD_OTLP_ENDPOINT=https://otlp.example.com

# Optional comma-separated key=value headers (e.g. auth, tenant scoping)
DEEPINTSHIELD_OTLP_HEADERS=Authorization=Bearer abc123,X-Scope-OrgID=team

Option reference

Per-workspace settings (Settings UI):

Setting	Field	Default	Description
Mask / redact before ingest	`mask_before_ingest`	`true`	Mask and redact before any span is exported.
Args sent as digest only	`args_digest_only`	`true`	Record tool arguments as digests only.
Security spans sample rate	`security_spans_sample_rate`	`1.0` (100%)	Fraction of decision spans captured.
Payload sample rate	`payload_sample_rate`	`0.2` (20%)	Fraction of full-payload spans captured.
Dual-export to metrics / Jaeger	`dual_export_metrics`	`true`	Also emit to your metrics / tracing pipeline.
Red-team containment score	`score_red_team`	`true`	Attach `redteam_containment` to traces.
Intent-match score	`score_intent`	`true`	Attach `intent_match` to traces.
Latency-budget score	`score_latency`	`true`	Attach `latency_budget` to traces.
LLM-as-judge score	`score_llm_judge`	`false`	Attach a custom LLM-judge score.
AI tool summaries	`summary_enabled`	`false`	Upgrade tool summaries to an LLM-written explanation.

Deployment-level export (server environment):

Variable	Description
`DEEPINTSHIELD_LANGFUSE_ENDPOINT`	Base URL of your self-hosted Langfuse instance.
`DEEPINTSHIELD_LANGFUSE_PUBLIC_KEY` / `DEEPINTSHIELD_LANGFUSE_SECRET_KEY`	Keys for Langfuse enrichment read-back.
`DEEPINTSHIELD_LANGFUSE_AUTH_BASIC`	Pre-encoded Basic auth for Langfuse OTel ingestion.
`DEEPINTSHIELD_OTLP_ENDPOINT`	OTLP/HTTP collector base URL for vendor-neutral fan-out.
`DEEPINTSHIELD_OTLP_HEADERS`	Comma-separated `key=value` headers for the OTLP exporter.

Examples

Self-hosted Langfuse only

Run traces into your own Langfuse and keep everything in-cluster:

DEEPINTSHIELD_LANGFUSE_ENDPOINT=http://langfuse-web.langfuse.svc.cluster.local:3000
DEEPINTSHIELD_LANGFUSE_PUBLIC_KEY=pk-lf-...
DEEPINTSHIELD_LANGFUSE_SECRET_KEY=sk-lf-...

In Settings, keep security spans at 100%, payloads at 20%, and enable the platform scores you care about. Open any run from Agent Traces to see the timeline plus Langfuse insights.

Fan out to Langfuse and an OTLP backend at once

Send the same spans to Langfuse and to Honeycomb (or a Collector, Tempo, or Datadog OTLP):

DEEPINTSHIELD_LANGFUSE_ENDPOINT=http://langfuse-web.langfuse.svc.cluster.local:3000
DEEPINTSHIELD_OTLP_ENDPOINT=https://api.honeycomb.io
DEEPINTSHIELD_OTLP_HEADERS=x-honeycomb-team=YOUR_API_KEY

Maximum privacy posture

Keep traces authoritative locally with no external export - leave both endpoints unset - and confirm in Settings that Mask / redact before ingest and Args sent as digest only are on. You still get the full local timeline, scores, and service graph.

Next steps

OpenTelemetry (OTel) - full reference for OTLP export formats, collectors, and headers.
Built-in Observability - request-level tracing, metrics, and dashboards for all LLM traffic.
Virtual keys - scope traces, cost, and the AI tool-summary model by virtual key.