Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Agent Observability 2026: State of the Market — LangSmith, Langfuse, AgentOps, Helicone, and Beyond

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

If you're evaluating observability tooling, you probably already know your framework — and the framework largely determines your best observability option. LangSmith's native LangGraph integration is the strongest argument for the LangChain ecosystem. Langfuse's self-hosted open-source model is the strongest argument for compliance-sensitive teams. AgentOps' agent-first design is the strongest argument for teams building dedicated agent products. Before committing to an observability platform, check LangSmith trace quotas and plan limits so you don't hit the free tier ceiling on day one of a production launch.

The market has converged on a shared vocabulary — trace, span, observation, session — borrowed from distributed systems observability and adapted for LLM-specific concerns. But beneath that shared vocabulary are meaningfully different architectures: SDK-based instrumentation vs proxy-based logging, cloud-only vs self-hostable, LangChain-native vs OpenTelemetry-standard. These architectural differences translate directly to tradeoffs in setup friction, data residency, query capability, and integration depth with your existing ML tooling.

This survey covers the six platforms that appear most frequently in production agent stacks in 2026: LangSmith, Langfuse, Helicone, AgentOps, Arize Phoenix, and Braintrust. For the framework-level architecture decisions that determine which observability tool you'll need, see the agent framework decision matrix. For cost modeling your actual agent workload, use the Claude API cost calculator alongside the trace volume estimates in this guide to budget your total observability spend.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Agent observability platforms — feature matrix, June 2026

Feature
Platform
Free tier
Open source
Agent-specific
LangChain native
LangSmith5k traces/monthNo (hosted only)Yes (LangGraph support)Yes (native)
Langfuse50k events/month cloud OR self-host freeYes (AGPL)Yes (session/trace)Partial (integration)
Helicone10k requests/month freeNo (but open-source components)Partial (proxy-based)Partial
AgentOpsGenerous free tierNoYes (agent-first design)Partial
Arize PhoenixSelf-host freeYes (open source)Yes (spans + traces)Yes
BraintrustFree tier (100k rows)NoPartial (eval-focused)Partial
OpenAI Platform logsIncluded with OpenAINoNo (LLM only)No
Weave (W&B)W&B team planPartialYesPartial
Traceloop (OpenLLMetry)Free (open source)Yes (Apache)Yes (OpenTelemetry)Yes
Portkey10k requests freeNoPartial (gateway-based)Partial

Sources, fetched 2026-06-21: https://langfuse.com/docs, https://agentops.ai/, https://docs.helicone.ai/, https://docs.smith.langchain.com/, https://arize.com/llm-observability/, https://www.braintrust.dev/

Why LLM observability isn't enough for agents

Standard LLM observability — logging each model call with its prompt, completion, token count, latency, and cost — is necessary but not sufficient for agent systems. An agent is not a single model call. It's a sequence of model calls, tool invocations, conditional routing decisions, state transitions, and (in multi-agent systems) inter-agent handoffs. Observing each LLM call in isolation tells you what happened inside each node; it tells you nothing about why the agent followed the path it did, which node transition caused a quality failure, or where a cost spiral originated.

**The new failure modes that agents introduce require new observability primitives.** Tool call failures (model calls a tool with malformed arguments, tool returns an error, agent either retries or halts), agent loops (agent calls the same tool or routes to the same node repeatedly without progress), context overflow (accumulated context exceeds window, causing silent truncation or explicit refusal), and inter-agent handoff errors (subagent receives and misinterprets its task instruction from the orchestrator) — none of these are visible in per-LLM-call logging. You need trace-level visibility into multi-step execution.

The metrics that matter for agents are fundamentally different from the metrics that matter for simple LLM calls. For a single LLM call, you care about latency and token count. **For an agent, you care about steps to completion** (how many tool calls and model calls did it take to finish the task?), **tool call success rate** (what fraction of tool calls succeeded vs failed or returned errors?), **cost per task completion** (total cost for the entire task, not per individual call), and **error rate by node** (which specific nodes in your agent graph fail most often and why?). Platforms that only provide per-call metrics force you to aggregate these task-level metrics yourself.

The 2026 observability market has largely solved the primitives question: trace (one complete agent run = one user task), spans (individual steps within the trace), and observations (individual LLM calls, tool calls, or other events within a span). The difference between platforms is how well they surface agent-specific insights on top of those primitives — run tree visualization for graph-based agents, replay capability for debugging failed runs, automated quality scoring for flagging regressions, and cost-per-task aggregation for budget management.

One practical consequence of the distinction: when evaluating observability platforms, don't just ask 'can it log my LLM calls?' Ask: 'can it show me the full execution graph of a failed agent run and tell me which node caused the failure?' That question distinguishes the platforms that understand agents from the ones that bolted agent support onto a per-call logging product.

A note on overhead: observability instrumentation itself adds latency and cost. SDK-based platforms (LangSmith, Langfuse, Braintrust) add a background export call on each trace — typically negligible (under 5ms per span). Proxy-based platforms (Helicone, Portkey) route every LLM call through their proxy server — adds 10-50ms of network latency per call and introduces a dependency on their infrastructure availability. For latency-sensitive applications, this distinction matters.


LangSmith: the LangChain and LangGraph native choice

LangSmith is the observability layer built by the LangChain team specifically for LangChain and LangGraph applications. The integration depth is unmatched: LangSmith automatically instruments every LangChain LCEL chain and LangGraph node without any additional code — you set the LANGCHAIN_TRACING_V2 and LANGCHAIN_API_KEY environment variables, and every run is automatically traced. The run tree visualization maps exactly to your LangGraph graph topology, with per-node latency, token counts, and model parameters visible in a single view. See full documentation at https://docs.smith.langchain.com/.

**LangSmith's run tree visualization is the strongest agent debugging tool in the market as of 2026.** When a LangGraph agent fails 8 steps into a 15-step execution, LangSmith shows you exactly which node failed, what input state it received, what tool call it made, and what error it returned — all in a visual graph that matches the LangGraph topology you defined in code. This is not achievable with generic log aggregation or per-call logging. Teams that migrate from print-based debugging to LangSmith consistently report 60-70% reduction in debugging time for complex agent failures.

The evaluation features are substantial: LLM-as-judge evaluators (define a scoring rubric, LangSmith automatically evaluates agent outputs against a test dataset), human annotation (route outputs to a human review queue, collect labels), and dataset versioning (track your golden test set over time and see quality trends as your agent evolves). These are the table-stakes features for running a rigorous prompt engineering loop in production.

**LangSmith's pricing and limitations:** Free tier is 5,000 traces/month — enough for development and small-scale testing but not for production traffic. The Plus plan at $39/seat/month includes 10,000 traces/month plus overage pricing. There is no self-hosted option — all trace data is sent to LangSmith's cloud infrastructure, which is a blocker for teams with data sovereignty requirements. The 14-day free retention on the free tier means you can't retrospectively analyze failures that occurred more than 2 weeks ago without upgrading.

For teams not using LangChain or LangGraph, LangSmith's value proposition drops significantly. The auto-instrumentation magic only applies to LangChain primitives. For Pydantic AI, AutoGen, or raw API calls, you need to manually wrap your calls with LangSmith's run tree API — it works, but the integration depth is not comparable to the LangGraph native experience. Evaluate LangSmith against Langfuse or AgentOps if your framework isn't LangChain.

Our recommendation: if your production agent runs on LangGraph, LangSmith is not optional — it's the debugging tool that makes LangGraph development tractable at scale. The 5k free trace limit is hit within days on any meaningful production traffic, so budget for the Plus plan from day one.


Langfuse: the open-source power user pick

Langfuse is the most frequently recommended alternative to LangSmith for teams with compliance requirements, high trace volume, or a preference for self-hosted infrastructure. The AGPL self-hosted version provides unlimited traces at no per-trace cost — you pay only for the infrastructure you run it on (a single $20/month VPS can handle moderate-volume production traffic). The Cloud free tier provides 50,000 events/month, which is 10× more generous than LangSmith's free tier. Full documentation at https://langfuse.com/docs.

**Langfuse's hierarchy of primitives — trace → span → observation — maps cleanly onto agent execution.** A trace represents one user task (one agent run). Spans represent sub-units of that run (individual agent steps, sub-agent invocations). Observations are the leaf-level events (LLM calls, tool calls, scores). This hierarchy enables the same run-tree visualization as LangSmith, and because it's based on an open schema, you can write custom instrumentation for any framework — Pydantic AI, AutoGen, raw API calls — without being limited to LangChain primitives.

Langfuse's prompt management feature deserves specific mention. It provides a version-controlled registry for your production prompts — you deploy prompt versions from LangFuse's UI, your application fetches the current version at runtime, and LangFuse tracks which prompt version was used in every trace. This makes A/B testing prompt changes in production possible without code deploys. In 2026, when prompt engineering is a continuous process rather than a one-time setup, this feature is high-value.

**LLM-as-judge scoring in Langfuse** works by defining a scoring function (a prompt + model combination that evaluates an output) and attaching it to traces automatically. You can score 100% of production traces for simple quality signals (is this answer relevant? is this summary accurate?) or run targeted evaluations on a sample. The scoring results are queryable via Langfuse's API and appear in the trace view alongside the raw output.

The LangChain integration is via Langfuse's callback handler — you pass the LangfuseCallbackHandler to your LangChain/LangGraph run and traces appear in Langfuse. This works well but is one layer of abstraction removed from LangSmith's native integration. For LangGraph specifically, LangSmith's run tree visualization is more detailed because it has access to LangGraph's internal graph structure. If you're choosing between them for a LangGraph application, LangSmith wins on integration depth; Langfuse wins on compliance, self-host, and volume.

Dataset versioning and evaluation pipelines in Langfuse support the full quality improvement loop: collect a golden test dataset, run your agent against it on schedule, track quality scores over time, and alert when a score drops below a threshold. This is the operational backbone of a production agent quality program. Most teams that take agent quality seriously end up building this infrastructure — Langfuse provides it out of the box.


Helicone: the proxy-first simplicity play

Helicone's architectural premise is different from every other platform in this matrix: instead of instrumenting your code with an SDK, you route your LLM API calls through Helicone's proxy server. Change one line of code — your API base URL — and you immediately get logging, cost tracking, latency monitoring, and rate limiting for every LLM call, with zero SDK installation and zero code changes to your existing call patterns. For teams that want immediate production visibility with minimal engineering investment, this is genuinely appealing. Full documentation at https://docs.helicone.ai/.

**The zero-code-change value proposition is real and fast.** Most teams can get Helicone logging production calls within an hour of deciding to try it. The dashboard shows cost per API key, cost per model, latency distributions, error rates, and a log of every prompt and completion. For a team that currently has zero observability into their LLM spend, this is a high-value upgrade from nothing.

The free tier is 10,000 requests/month — reasonable for development but tight for production. Paid plans start at $20/month for 100,000 requests. Helicone also provides caching (returns cached responses for identical prompts, reducing cost and latency for repetitive queries), rate limiting (protect against abuse or runaway agent loops), and prompt templates (version and track prompts through the proxy layer).

**The proxy architecture has real tradeoffs.** First, latency: every LLM call routes through Helicone's servers, adding 10-50ms of network latency. For applications where LLM calls are already the dominant latency (2-5 seconds), this is often negligible. For applications with strict latency requirements (under 500ms), it may be a concern. Second, availability: your production LLM calls have a dependency on Helicone's proxy availability. If Helicone experiences an outage, your LLM calls fail unless you've implemented a fallback bypass. Third, agent-specific visibility: because Helicone observes individual LLM calls (not agent execution graphs), it doesn't provide the run-tree visualization that makes LangSmith or Langfuse valuable for debugging complex agent failures.

Helicone is best positioned as a complement to a full observability platform, not a replacement. Use Helicone for immediate cost monitoring and caching across all your LLM calls, and use LangSmith or Langfuse for deep agent debugging. The two aren't mutually exclusive — running both (Helicone as the proxy cost layer, LangSmith as the agent trace layer) is a valid production stack that avoids the per-call overhead of logging every call twice through an SDK.

In 2026, Helicone has added several gateway-level features (model load balancing, automatic fallback between providers, A/B routing between model versions) that push it toward the 'AI gateway' category rather than pure observability. These gateway features are valuable for teams running multi-model architectures (routing between Claude and GPT-5 based on cost or availability), but they also increase the operational dependency on Helicone's availability.


AgentOps: designed specifically for agents

AgentOps distinguishes itself from every other platform in this matrix with a simple claim: it was designed from the ground up for agents, not retrofitted from an LLM logging product. The core primitive is the agent session — a complete run of an agent system from task initiation to completion, with all LLM calls, tool calls, and agent events recorded as first-class objects within that session. For teams building products where agent runs are the primary unit of value (autonomous task completion, long-horizon research, code generation), this session-first design matches the mental model of the product. See https://agentops.ai/ for documentation.

**AgentOps' replay feature is unique in the market.** For any recorded agent session, you can replay the entire execution — stepping through each agent decision, tool call, and LLM response in sequence — to understand exactly how the agent reached a particular state. This is the observability equivalent of a debugger with time-travel capability. For debugging complex failure modes where the failure is a consequence of multiple prior decisions, replay is irreplaceable.

Cost tracking in AgentOps is aggregated at the session level: each agent run shows its total cost, broken down by LLM call, with the ability to filter by agent type, task type, and time period. This session-level cost aggregation maps directly to the business question most agent teams care about: 'how much does it cost us per completed task?' rather than 'how much did this individual LLM call cost?'

**Error tracking in AgentOps captures agent-specific failure events** that generic logging tools miss: tool call failures (including the error message and the agent state at the time of failure), LLM refusals (when the model declines to execute a tool call or follow an instruction), and infinite loops (detected automatically via repeated identical tool calls within a session). These are the failure patterns that are invisible in per-call logging but are the most common causes of agent production incidents.

The generous free tier makes AgentOps accessible for teams earlier in their development cycle than LangSmith or Langfuse — you won't hit the ceiling during development or early-stage production. The paid tiers add higher retention, more sessions per month, and team collaboration features. One limitation: AgentOps is not open-source, which rules it out for teams with strict data residency or self-hosted requirements.

Our recommendation: if your product is an agent product — the core value your users experience is the result of agent task completion — AgentOps' session-first design and replay capability are worth evaluating seriously. It pairs well with Helicone (Helicone for cost gateway, AgentOps for agent session replay and error tracking) for teams that want coverage at both the call level and the session level.


Arize Phoenix: open-source with enterprise backing

Arize Phoenix is the open-source observability tool from Arize AI, a company with deep roots in traditional ML observability. Phoenix brings a machine learning engineer's perspective to LLM observability: spans and traces are implemented via the OpenInference specification (an OpenTelemetry extension for LLM/agent observability), making Phoenix interoperable with any OpenTelemetry-compatible backend. This standards-based approach is Phoenix's strongest differentiator — your instrumentation is not vendor-locked. See documentation and source at https://arize.com/llm-observability/.

**Phoenix is fully self-hostable at no cost** (Apache 2.0 license), with Arize's hosted SaaS tier available for enterprises that want managed infrastructure and additional ML monitoring capabilities. The self-hosted path is genuinely straightforward — a single Docker command launches a Phoenix server with a local SQLite store, suitable for development and low-volume production. For high-volume production, Phoenix supports PostgreSQL as the backend store.

The LlamaIndex integration is Phoenix's tightest framework binding — Arize and LlamaIndex collaborated on the OpenInference spec, and Phoenix receives richer instrumentation data from LlamaIndex RAG pipelines than most other platforms. For teams running LlamaIndex-based retrieval-augmented agents, Phoenix is the natural observability choice.

In 2026, Phoenix significantly expanded its agent-specific tracing support. Multi-agent spans (each agent invocation as a distinct span within the trace), tool call visualization (tool inputs, outputs, and errors visible inline in the trace), and agent graph reconstruction (inferring the agent's execution graph from span parentage) are all available in the current release. The agent graph reconstruction is less polished than LangSmith's native LangGraph visualization, but it's impressive for a standard it extracts from OpenTelemetry spans.

**OpenInference as a standard is Phoenix's long-term bet.** If OpenInference gains broad adoption (it already has integrations in LlamaIndex, Langchain, AutoGen, and several commercial platforms), teams that instrument once in OpenInference format can switch between Phoenix, Langfuse (via the OpenTelemetry bridge), and Traceloop without re-instrumenting. For teams that prioritize future portability over immediate integration depth, this standards-based approach is architecturally sound.

Phoenix is most compelling for teams that: (a) have compliance requirements requiring self-hosted infrastructure, (b) are already using Arize for traditional ML model monitoring and want to extend that into LLM/agent monitoring, (c) are running LlamaIndex-based systems, or (d) prioritize OpenTelemetry interoperability for future portability. It is less compelling for LangChain/LangGraph teams where LangSmith's integration depth is a clear winner.


Braintrust: evaluation-first observability

Braintrust occupies a distinct position in the observability market: it is primarily an evaluation and experimentation platform that includes production monitoring, not primarily a monitoring platform that includes evaluation. This framing matters because it shapes the feature set and the ideal user profile. Teams that run rapid prompt iteration cycles — changing prompts frequently, measuring quality impact, and shipping improvements on a tight loop — will find Braintrust's design philosophy deeply congruent. Teams whose primary need is production incident response and debugging will find other platforms better suited. See https://www.braintrust.dev/.

**Braintrust's core workflow is: dataset → experiment → scores → production.** You maintain a golden test dataset (input + expected output pairs, from real user queries), run experiments (different prompts, models, or agent configurations against the dataset), collect scores (from LLM-as-judge, human annotation, or automated metrics), and compare results across experiments. When an experiment beats the current production baseline, you promote it. This is the right workflow for teams that treat prompt engineering as a continuous improvement discipline.

The free tier provides 100,000 rows — generous enough for substantial experimentation work. The dataset versioning system tracks the evolution of your test set alongside the evolution of your prompts, so you can see whether a quality improvement was genuine or just an artifact of your test set changing. This longitudinal tracking of the evaluation process itself is uncommon in the market.

**Braintrust's production monitoring is real** — logs of every production inference, latency and cost metrics, and the ability to sample production outputs into your evaluation dataset for continuous quality monitoring. But the interface is optimized around experiments, not real-time alerts. If your primary concern is catching production incidents fast (a 10-minute loop on agent error rates), LangSmith or AgentOps will serve you better. If your primary concern is steadily improving agent quality over weeks and months, Braintrust's experiment-first design is more aligned.

LLM-as-judge in Braintrust is well-developed — you can define custom evaluators as prompts (which model call do you make to score an output?) and attach them to experiments. Braintrust automatically handles the logistics of running the scorer at scale, aggregating scores, and displaying distributions. For teams that want to catch quality regressions before they hit production (by running evals on every code commit), Braintrust's CI/CD integration (via its SDK and a GitHub Action) makes this straightforward.

One practical note on Braintrust vs LangSmith for LangGraph teams: LangSmith's run tree visualization is significantly better for real-time agent debugging. Braintrust's experiment tracking is significantly better for longitudinal quality improvement. Several mature teams run both — LangSmith for production monitoring and incident response, Braintrust for systematic quality improvement experiments. The two tooling philosophies are complementary, not competitive.


2026 market trends: what's changing in agent observability

**OpenTelemetry standardization is the most significant structural trend in agent observability in 2026.** The OpenInference specification (an OpenTelemetry extension for LLM and agent traces, maintained by Arize and now contributed to by LlamaIndex, LangChain, and others) is gaining broad adoption. Traceloop's OpenLLMetry, Arize Phoenix, and several commercial platforms now support it. The implication: instrumentation written against the OpenTelemetry standard can be exported to multiple backends without re-instrumentation. Teams that instrument in OpenInference format today are not locked into a single observability vendor.

**LLM-as-judge has become table-stakes for production agent quality programs.** In 2025, LLM-as-judge was a novel technique described in research papers. In 2026, it's the default automated evaluation mechanism in LangSmith, Langfuse, Braintrust, and AgentOps. The market has effectively agreed that automated scoring of agent outputs against quality rubrics is not optional for teams that want to catch regressions before users do. The remaining competition is on the quality and customizability of the scoring templates.

Cost-per-task is emerging as the primary business metric for agent observability, displacing per-call cost and per-call latency as the metrics that matter most to engineering and product leadership. An agent that completes tasks in 8 steps is measurably better than one that completes the same tasks in 15 steps — both in direct cost and in latency. Platforms that aggregate cost and latency to the task level (AgentOps, LangSmith with run trees) are better positioned for 2026's business requirements than platforms that only surface per-call metrics.

**Multi-model tracing** — agents that mix Claude, GPT-5, and Gemini calls within a single task — is a production reality for cost-optimizing teams in 2026. Most observability platforms now support multi-provider logging within a single trace. The quality of that cross-provider support varies: LangSmith and Langfuse handle it well because their abstractions are model-agnostic. Helicone handles it well because the proxy layer normalizes all providers. OpenAI Platform logs obviously cannot.

Self-host vs SaaS tension is growing as enterprise teams grapple with data sovereignty and AI governance requirements. Sending every production LLM call (including all user inputs and model outputs) to a third-party observability service raises compliance questions under GDPR, HIPAA, and emerging AI governance frameworks. Langfuse (AGPL self-host), Arize Phoenix (Apache self-host), and Traceloop (Apache self-host) are benefiting from this tension. LangSmith (no self-host option) and hosted-only platforms face this headwind as their enterprise sales cycles deepen.

**Convergence on a standard hierarchy** (session → trace → span → observation) is simplifying multi-platform evaluation. Teams running multiple observability tools (one for production monitoring, one for evals) can map between them using this shared vocabulary. Langfuse and Phoenix both implement this hierarchy; LangSmith uses run trees but the mapping is clean. This convergence makes it feasible to pipe data from one platform to another — for example, sampling production traces from LangSmith into a Braintrust evaluation dataset — which is a real workflow for mature teams.

Setting up production agent observability from scratch

  1. 1

    Step 1: Pick your platform tier

    Match your primary constraint to the right platform before evaluating features. LangChain or LangGraph stack → start with LangSmith (deepest integration, best run tree visualization, see https://docs.smith.langchain.com/). Compliance, high volume, or self-hosted requirement → Langfuse (AGPL, unlimited on self-hosted, see https://langfuse.com/docs). Zero-code-change for immediate visibility → Helicone proxy (one-line setup, see https://docs.helicone.ai/). Building an agent product where sessions are the unit of value → AgentOps (agent-first design, session replay, see https://agentops.ai/). Eval-first quality program → Braintrust. OpenTelemetry / interoperability requirement → Arize Phoenix. Don't mix frameworks until you've saturated the value of one.

  2. 2

    Step 2: Instrument at the right granularity

    Define your hierarchy before you write a line of instrumentation code: trace = one user task (one complete agent run); span = one sub-unit of that task (one agent step, one subgraph execution); observation = one leaf event (one LLM call, one tool call, one score). Don't over-instrument — tracing every internal Python function call inflates trace size, exhausts your free tier quota in days, and makes traces unreadable. The right granularity is the level at which you'd want to start debugging a failure. For most agents, that's the agent step (span level) and the LLM call (observation level) — not every internal function.

  3. 3

    Step 3: Define your key agent metrics from day one

    Before you start collecting data, define the four metrics you'll use to measure agent health: (1) steps to completion — how many tool calls and model calls per completed task; (2) tool call success rate — what fraction of tool calls succeed vs return errors; (3) cost per completed task — total token cost for the full trace; (4) error rate by node — which specific steps in your agent graph fail most often. Set baseline targets for each before launch so you have a reference line for detecting regressions. Without pre-defined targets, anomaly detection is guesswork.

  4. 4

    Step 4: Set up at least one automated evaluation

    Build an LLM-as-judge evaluator on a 20-example golden set and run it weekly. The evaluator needs two things: a scoring prompt ('Given this task input and agent output, rate the output quality from 1-5 on this rubric: ...') and a model (Sonnet 4.6 at $3/M input is cost-effective for scoring). Run it automatically on every code merge to catch quality regressions before production. A quality score trend line — even a rough weekly average — gives you early warning of degradation before users report it. This single evaluation is worth more than any other observability investment for catching problems early.

  5. 5

    Step 5: Build a cost alerting loop

    Calculate: cost per task × daily task volume = daily LLM spend baseline. Set an alert at 2× your expected baseline cost per day. Agent loops, unexpected multi-agent spawning, and model routing failures all appear as sudden cost spikes before they appear as user complaints — cost is a leading indicator for agent failures, not a lagging one. In LangSmith, use the cost tracking in run trees. In Langfuse, use the cost aggregation API. In AgentOps, the cost-per-session dashboard provides this natively. However you implement it, the alert must fire before you hit your monthly API budget cap — not after.

Frequently Asked Questions

What is the best agent observability tool in 2026?

It depends on your framework and priorities. For LangGraph-based agents, LangSmith is the strongest choice — its native LangGraph integration provides run tree visualization that no other tool matches. For teams with compliance requirements or high trace volume, Langfuse self-hosted (AGPL, unlimited traces) is the most cost-effective option. For teams building agent-first products where agent sessions are the primary product unit, AgentOps' session-first design and replay capability are unique. For teams running a rigorous prompt improvement loop (eval-driven development), Braintrust is the most aligned tool. There is no single best answer — match the platform to your primary constraint.

Is Langfuse really free for self-hosting?

Yes — the self-hosted version of Langfuse (via Docker Compose or Kubernetes) is free under the AGPL license with no trace volume limit. You pay only for the infrastructure you run it on. A single VPS at $20-40/month can handle moderate production traffic. The Cloud free tier is also generous at 50,000 events/month (10× LangSmith's free tier). Paid cloud tiers start at approximately $59/month for higher volume, longer retention, and SSO. The AGPL license means if you modify and distribute Langfuse, you must open-source your modifications — for internal deployments, this is not a practical restriction.

How does Helicone differ from LangSmith in practice?

The architectural difference is fundamental: Helicone is proxy-based (you route your LLM API calls through their proxy — no SDK, no code changes to your agent logic), while LangSmith is SDK-based (you instrument your LangChain/LangGraph code with LangSmith's callback system). In practice: Helicone gets you logging in under an hour with zero code changes; LangSmith gets you deep agent trace visualization that Helicone can't match. Helicone adds 10-50ms of proxy latency per LLM call; LangSmith adds negligible background export overhead. For simple per-call logging and cost tracking, Helicone wins on setup speed. For complex agent debugging and run tree visualization, LangSmith wins on capability.

What metrics should I track for agents that I wouldn't track for a simple LLM chatbot?

Four agent-specific metrics beyond standard per-call latency and token count: (1) steps to completion — how many tool calls and model calls per completed task (rising steps = quality regression or prompt drift); (2) tool call success rate — fraction of tool calls that succeed vs return errors or malformed arguments; (3) cost per completed task — total cost for the full trace, not per-call; (4) error rate by node — which nodes in your agent graph fail most often (identifies where to focus debugging and testing). These task-level metrics are invisible in per-call logging and are the most actionable signals for improving production agent quality.

Does Arize Phoenix require Arize's hosted service?

No — Phoenix is fully open-source (Apache 2.0 license) and self-hostable with a single Docker command. Arize's hosted tier (arize.com) is entirely optional and adds enterprise features like managed infrastructure, team collaboration, and traditional ML model monitoring alongside LLM monitoring. The self-hosted Phoenix server supports local SQLite for development and PostgreSQL for production. For teams with data sovereignty requirements, the fully self-hosted open-source path requires no communication with Arize's cloud infrastructure. This is the most permissive licensing in the market for a feature-complete observability tool.

What is OpenInference and why should I care?

OpenInference is an OpenTelemetry extension standard for LLM and agent observability traces, maintained by Arize and contributed to by LlamaIndex, LangChain, and other ecosystem players. It defines a standard schema for spans (LLM calls, embedding calls, retrieval calls, agent steps) that any compliant backend can ingest. Why it matters: if you instrument your agent in OpenInference format, you can export traces to Phoenix, Langfuse (via the OpenTelemetry bridge), Traceloop, or any future OpenTelemetry-compatible backend without re-instrumenting. It reduces vendor lock-in risk for teams that instrument now and may change observability providers later.

Can I use multiple observability tools at the same time?

Yes — most agent frameworks support multiple tracing backends simultaneously. In LangChain/LangGraph, you can pass multiple callbacks in a list, routing traces to both LangSmith and Langfuse concurrently. In practice, teams often run Helicone (as a lightweight cost/latency proxy layer) alongside LangSmith (for deep agent trace visualization) — the two serve different functions and the overhead is manageable. However, duplicate tracing doubles your trace ingestion volume on both platforms, which matters for free tier limits. Only run multiple platforms if they serve genuinely different purposes — don't duplicate unless there's a specific capability gap that one platform fills and the other doesn't.

What does Braintrust do that LangSmith doesn't, and vice versa?

Braintrust is optimized for prompt experimentation and eval result tracking over time — its core workflow is dataset → experiment → scores → compare → promote. It excels at systematic quality improvement: tracking how quality metrics evolve across prompt versions, model changes, and agent architecture changes over weeks and months. LangSmith is optimized for production agent trace monitoring, run tree visualization, and real-time debugging of agent failures. LangSmith is better when you need to answer 'why did this specific agent run fail?' Braintrust is better when you need to answer 'is my agent's quality trending up or down over the last 30 days?' Mature teams often run both for different stages of their quality loop.

Better traces start with cleaner prompts.

Our AI Prompt Generator writes structured, step-labeled prompts that make your LangSmith and Langfuse traces readable and your evals reliable. 14-day free trial, no card.

Browse all prompt tools →