Skip to content

Agentic Observability

Agentic Observability gives you a complete, self-hosted view of what your agents actually did. Each agent run is grouped into a single timeline that captures every LLM call (with tokens and cost) and every tool call (with its allow / deny / mask / approval verdict, policy, obligations, and decision latency). On top of that timeline, the platform attaches per-run quality scores so you can track agent behavior over time.

Everything stays inside your own deployment. Traces are authoritative locally, and you can optionally fan them out to a self-hosted Langfuse instance or any OTLP backend you already run - without shipping raw prompt or completion text off your network. Arguments are recorded as digests only, so you get full visibility without creating a new data-exfiltration surface.

Telemetry is asynchronous, so turning observability on adds no measurable latency to your agents.

  • One timeline per agent run - answer “what did this agent actually do?” without standing up extra infrastructure.
  • Security-native traces - each tool call carries its verdict, policy, enforcement context, and obligations alongside cost and token usage, so you debug agent behavior and policy enforcement in the same place.
  • Per-run quality scores - deterministic platform scores plus an optional LLM-as-judge surface intent, integrity, latency, and red-team signals.
  • No vendor lock-in - export byte-identical spans to a self-hosted Langfuse and/or any OTLP backend at the same time.
  • Zero-data-retention by default - arguments are sent as digests, and a redaction pass strips sensitive attributes before export.

Turn on Agentic Observability when you want to:

  • Investigate why a specific agent run was blocked, masked, or sent for approval, step by step.
  • Track per-run and per-virtual-key LLM cost and token usage for FinOps reporting.
  • Monitor agent quality and integrity trends (intent match, tool integrity, latency budget) over time.
  • Feed agent traces and scores into a Langfuse project or your existing OpenTelemetry backend (Collector, Tempo, Honeycomb, Datadog OTLP) without re-instrumenting your agents.
  • Visualize which agents call which tools, and where denials cluster, using the service graph.

This is the operational, sampled view of your agents. It complements the immutable, hash-chained agent decision audit used for compliance, which remains a separate store.

Open Agentic Observability from the workspace sidebar. It has three sections, plus a dedicated analytics dashboard.

The Overview page shows workspace-scoped KPI tiles - total runs, active runs, average latency, total cost, tool calls (with allow / deny counts), and blocked runs - followed by a Recent runs table. Each tile deep-links into a pre-filtered traces list or the analytics dashboard. Click any run to open its timeline.

The Agent Traces list shows one row per run: when it started, the primary tool and agent chain, the caller (tenant + virtual key name), tool-call count, a verdict breakdown (allow / deny / mask / approval), latency, cost, and run status. Filter the list by status - All, Running, Awaiting, Blocked, Complete, or Error.

Open a trace for its unified execution view (a single screen - no tabs):

  • Summary header - caller, virtual key, primary tool and model, observation count with verdict tallies, status, latency, cost, tokens.
  • Execution map - the run’s “agentic fingerprint”: each step is a node labelled with the actual tool / action name (not “Step N”) and coloured by its verdict (allow / deny / approval / mask); parallel agents fan out from their shared parent. Toggle to a list view for the indented timeline. Click a node to inspect it.
  • Step inspector - for the selected step: verdict, risk score (0–100 + band), policy, reason, decision latency, the OWASP finding with its NIST / ISO 42001 / EU AI Act / MITRE ATLAS crosswalk, obligations, args digest, and - when capture is enabled - the masked prompt / input.
  • Agent identity card - the acting agent’s authn binding (identity provider, virtual key, directory), ownership from the registry (owner, team, creator, version), and its risk score; a Register call-to-action when the agent is a shadow (unregistered).
  • Audit trail - the run’s hash-chained, tamper-evident decision records (sequence · tool · verdict · policy · prev_hash → hash) with a continuity-verified badge. This is the compliance system-of-record, distinct from the sampled observability timeline.
  • Langfuse insights - when a Langfuse endpoint is configured, server-computed model, token, cost, tags, and scores, with an Open in Langfuse deep link.
  • Local scores - per-run scores aggregated by metric, with sample count, average, and range.

Open Analytics → Agentic Observability and select the Scores tab to see quality scores across all runs in the selected time window. Four metrics are tracked:

ScoreWhat it measuresSourceTier
latency_budgetHow well a decision stayed within its per-tool added-latency budget.Platform (deterministic)All
tool_integrity1.0 when a tool call matched its declared contract, decaying as divergence risk rises.Platform (deterministic)All
intent_matchWhether the call matches the stated task intent.Platform / LLMBusiness
llm_judgeOptional response-quality scoring by a sampled, locally-run LLM judge.LLM-as-judgeBusiness
redteam_containmentContainment against a red-team corpus mapped to the OWASP Agentic Top 10.External ingestBusiness

The Scores tab renders a rolling-average trend, a score distribution, and a Score vs status correlation so you can see how quality relates to blocked or errored runs. The deterministic platform scores (latency_budget, tool_integrity) cost nothing on the hot path and incur no LLM spend.

Under Agentic Security → Service Graph, you get a workspace-scoped “who-calls-whom” map: agent (caller) nodes connected to tool nodes, plus agent-to-agent delegation hops. Every edge is colored by its dominant verdict, so denials and approval-gated paths stand out at a glance. Pick a time window (last hour, 24 hours, or 7 days), drag nodes to rearrange, click a node to inspect its verdict mix and busiest neighbors, and double-click to jump to the related tool or approvals page. From an agent node you can deep-link straight to its Traces.

The analytics dashboard also includes a Volume, Cost & Tokens, Latency, Status, Agents & VKs, and Observations tab, with server-aggregated charts (top cost traces, top virtual keys, agent chains, sessions, models, step types, and error reasons) that stay accurate at any row count.

The Agentic sidebar group opens on Executions (the runs hub) and carries a sub-tab strip to the security surfaces, all derived from the same decisions - read-only, off the decision hot path:

  • Posture - the agentic equivalent of a posture-gaps dashboard: detected gaps grouped by OWASP Agentic category, risk-scored, with KPI tiles, a severity donut, and a Compliance coverage card that aggregates the distinct OWASP / NIST AI RMF / ISO 42001 / EU AI Act / MITRE ATLAS controls touched by your flagged decisions.
  • Findings - every flagged action across the workspace as a triage card: severity, OWASP mapping, the compliance crosswalk, decision evidence, and why this is a risk / how to investigate / how to respond. Filter by severity or search by tool / OWASP / agent.
  • In-Line Control - every action that was blocked or held for approval, as an enforcement table (rule · prevention type · agent · tool · time) with a drill-in taint/integrity flow and the full decision evidence.

These surfaces never add latency: they aggregate the decision and observation rows the audit and analytics already write.

Configuration splits into two parts: per-workspace governance, sampling, and score toggles you control in the Web UI, and the deployment-level export transport (Langfuse / OTLP endpoint, keys, and headers) set in the server environment.

Go to Agentic Observability → Settings.

Governance & sampling

  • Mask / redact before ingest - apply masking and redaction before any span leaves the platform.
  • Args sent as digest only - record tool arguments as digests, never raw values.
  • Capture prompt preview (opt-in) - off by default (preserving zero-data-retention). When on, a PII-masked, ≤480-character snippet of each step’s prompt / input is stored so the execution view can show what was asked per action. Without it (or a connected Langfuse), prompt text is simply not shown.
  • Security spans sample rate (%) - what fraction of decision spans to capture. Keep this at 100% so verdicts are never missed.
  • Payload sample rate (%) - what fraction of full-payload spans to capture (default 20%).
  • Dual-export to metrics / Jaeger - also emit to your metrics / tracing pipeline.

Scores & evaluations - toggle which scores attach to traces:

  • Red-team containment
  • Intent-match
  • Latency-budget
  • LLM-as-judge (custom) - optional; off by default.

AI tool summaries - optionally enable an LLM-written “what this tool does” summary (shown in Approvals and Tools). When enabled, pick a Virtual API key and a Model from that key. The summary call runs asynchronously, off the decision hot path; without a key/model the deterministic, schema-derived summary is always shown.

Use Test connection to probe the configured Langfuse endpoint, then Save. Changes apply on the next decision span this workspace emits.

Per-workspace settings (Settings UI):

SettingFieldDefaultDescription
Mask / redact before ingestmask_before_ingesttrueMask and redact before any span is exported.
Args sent as digest onlyargs_digest_onlytrueRecord tool arguments as digests only.
Security spans sample ratesecurity_spans_sample_rate1.0 (100%)Fraction of decision spans captured.
Payload sample ratepayload_sample_rate0.2 (20%)Fraction of full-payload spans captured.
Dual-export to metrics / Jaegerdual_export_metricstrueAlso emit to your metrics / tracing pipeline.
Red-team containment scorescore_red_teamtrueAttach redteam_containment to traces.
Intent-match scorescore_intenttrueAttach intent_match to traces.
Latency-budget scorescore_latencytrueAttach latency_budget to traces.
LLM-as-judge scorescore_llm_judgefalseAttach a custom LLM-judge score.
AI tool summariessummary_enabledfalseUpgrade tool summaries to an LLM-written explanation.

Deployment-level export (server environment):

VariableDescription
DEEPINTSHIELD_LANGFUSE_ENDPOINTBase URL of your self-hosted Langfuse instance.
DEEPINTSHIELD_LANGFUSE_PUBLIC_KEY / DEEPINTSHIELD_LANGFUSE_SECRET_KEYKeys for Langfuse enrichment read-back.
DEEPINTSHIELD_LANGFUSE_AUTH_BASICPre-encoded Basic auth for Langfuse OTel ingestion.
DEEPINTSHIELD_OTLP_ENDPOINTOTLP/HTTP collector base URL for vendor-neutral fan-out.
DEEPINTSHIELD_OTLP_HEADERSComma-separated key=value headers for the OTLP exporter.

Run traces into your own Langfuse and keep everything in-cluster:

Terminal window
DEEPINTSHIELD_LANGFUSE_ENDPOINT=http://langfuse-web.langfuse.svc.cluster.local:3000
DEEPINTSHIELD_LANGFUSE_PUBLIC_KEY=pk-lf-...
DEEPINTSHIELD_LANGFUSE_SECRET_KEY=sk-lf-...

In Settings, keep security spans at 100%, payloads at 20%, and enable the platform scores you care about. Open any run from Agent Traces to see the timeline plus Langfuse insights.

Fan out to Langfuse and an OTLP backend at once

Section titled “Fan out to Langfuse and an OTLP backend at once”

Send the same spans to Langfuse and to Honeycomb (or a Collector, Tempo, or Datadog OTLP):

Terminal window
DEEPINTSHIELD_LANGFUSE_ENDPOINT=http://langfuse-web.langfuse.svc.cluster.local:3000
DEEPINTSHIELD_OTLP_ENDPOINT=https://api.honeycomb.io
DEEPINTSHIELD_OTLP_HEADERS=x-honeycomb-team=YOUR_API_KEY

Keep traces authoritative locally with no external export - leave both endpoints unset - and confirm in Settings that Mask / redact before ingest and Args sent as digest only are on. You still get the full local timeline, scores, and service graph.