Why LLM observability isn't enough for agents
Standard LLM observability — logging each model call with its prompt, completion, token count, latency, and cost — is necessary but not sufficient for agent systems. An agent is not a single model call. It's a sequence of model calls, tool invocations, conditional routing decisions, state transitions, and (in multi-agent systems) inter-agent handoffs. Observing each LLM call in isolation tells you what happened inside each node; it tells you nothing about why the agent followed the path it did, which node transition caused a quality failure, or where a cost spiral originated.
**The new failure modes that agents introduce require new observability primitives.** Tool call failures (model calls a tool with malformed arguments, tool returns an error, agent either retries or halts), agent loops (agent calls the same tool or routes to the same node repeatedly without progress), context overflow (accumulated context exceeds window, causing silent truncation or explicit refusal), and inter-agent handoff errors (subagent receives and misinterprets its task instruction from the orchestrator) — none of these are visible in per-LLM-call logging. You need trace-level visibility into multi-step execution.
The metrics that matter for agents are fundamentally different from the metrics that matter for simple LLM calls. For a single LLM call, you care about latency and token count. **For an agent, you care about steps to completion** (how many tool calls and model calls did it take to finish the task?), **tool call success rate** (what fraction of tool calls succeeded vs failed or returned errors?), **cost per task completion** (total cost for the entire task, not per individual call), and **error rate by node** (which specific nodes in your agent graph fail most often and why?). Platforms that only provide per-call metrics force you to aggregate these task-level metrics yourself.
The 2026 observability market has largely solved the primitives question: trace (one complete agent run = one user task), spans (individual steps within the trace), and observations (individual LLM calls, tool calls, or other events within a span). The difference between platforms is how well they surface agent-specific insights on top of those primitives — run tree visualization for graph-based agents, replay capability for debugging failed runs, automated quality scoring for flagging regressions, and cost-per-task aggregation for budget management.
One practical consequence of the distinction: when evaluating observability platforms, don't just ask 'can it log my LLM calls?' Ask: 'can it show me the full execution graph of a failed agent run and tell me which node caused the failure?' That question distinguishes the platforms that understand agents from the ones that bolted agent support onto a per-call logging product.
A note on overhead: observability instrumentation itself adds latency and cost. SDK-based platforms (LangSmith, Langfuse, Braintrust) add a background export call on each trace — typically negligible (under 5ms per span). Proxy-based platforms (Helicone, Portkey) route every LLM call through their proxy server — adds 10-50ms of network latency per call and introduces a dependency on their infrastructure availability. For latency-sensitive applications, this distinction matters.