Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Agent Eval with Langfuse (2026): 5-Step Eval Pipeline for LLM Agents

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

Most agent builders start evaluating their agents too late — after a quality regression surfaces in production. The right time to build an eval pipeline is before you ship the first version: instrument traces from day one, build a golden dataset from the first week of production data, and run regression tests before every model or prompt change. Langfuse makes this easier than any alternative in 2026 — it combines trace collection, dataset management, LLM-as-judge scoring, and a quality dashboard in one tool with a generous free tier. For the agent frameworks this eval pipeline integrates with, see our LangGraph tutorial and CrewAI tutorial.

The eval pipeline has five stages. Traces capture the full agent execution — every LLM call, every tool invocation, every input/output — and send them to Langfuse for storage and analysis. A dataset is a curated set of representative inputs with expected outputs — your golden test suite. Scoring annotates traces with quality metrics — either human labels or LLM-as-judge automated scoring. Dashboards aggregate scores over time to surface regressions and trends. Replay reruns specific traces through a new model version and compares scores — the core regression testing workflow. Source: Langfuse documentation.

Below: step-by-step setup for each stage, full code examples, and the operational playbook for running evals in CI/CD. Pricing: Langfuse Hobby is free forever (cloud, up to 50K traces/month), Pro is $59/month (unlimited), Enterprise is custom. Self-hosted is free for all tiers. For cost tracking within traces (which Langfuse captures automatically), see our agent loop cost calculator and tool use overhead cost calculator.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Langfuse feature reference and pricing, June 2026

Feature
Feature
Hobby (free)
Pro ($59/mo)
Self-hosted
Traces per month50,000UnlimitedUnlimited
Trace retention30 days90 daysYour choice
Dataset runsUnlimitedUnlimitedUnlimited
LLM-as-judge evalsManual onlyAutomatedAutomated
Team members1UnlimitedUnlimited
Annotation queuesNoYesYes
API accessYesYesYes
Prompt managementYesYesYes
Dashboard / analyticsBasicAdvancedAdvanced
SSO / RBACNoYesYes
CI/CD integrationYes (API)Yes (API + webhooks)Yes
Model cost trackingYes (manual)Yes (auto)Yes (auto)

Sources, fetched 2026-06-21: Langfuse pricing (https://langfuse.com/pricing), Langfuse documentation (https://langfuse.com/docs). Hobby plan is free forever with no credit card required. Pro plan is $59/month per organization (not per seat). Self-hosted is free — deploy via Docker, Kubernetes, or Railway. LLM-as-judge automated eval requires Pro or self-hosted because it uses Langfuse's managed eval pipeline; manual scoring via API is available on all plans. Model cost tracking is automatic when using Langfuse's LangChain callback integration — it reads token usage from model responses.

Step 1: instrument your agent with traces

**Traces are the foundation of every eval pipeline.** A trace is a hierarchical record of one complete agent execution — the root trace captures the full run, child spans capture individual LLM calls, tool calls, and retrieval steps. Langfuse stores traces with full input/output, token counts, latency, and cost metadata. Without traces, you're debugging production issues from symptoms (user complaints, cost anomalies) rather than from the actual execution data. Source: Langfuse tracing docs.

**The fastest integration: LangChain callback handler.** If your agent is built on LangChain, LangGraph, or CrewAI (all LangChain-compatible), the Langfuse callback captures traces automatically: `from langfuse.callback import CallbackHandler; langfuse_handler = CallbackHandler(public_key='pk-lf-...', secret_key='sk-lf-...', host='https://cloud.langfuse.com'); result = graph.invoke({'messages': [HumanMessage(content=query)]}, config={'callbacks': [langfuse_handler]})`. Every LLM call, tool call, and node traversal in the agent is captured as a nested span in Langfuse. Zero code changes to your agent logic required.

**Manual instrumentation with the Langfuse SDK.** For agents not built on LangChain, use the Python SDK directly: `from langfuse import Langfuse; langfuse = Langfuse(); trace = langfuse.trace(name='research-agent', user_id=user_id, session_id=session_id, input={'query': query}); span = trace.span(name='web-search', input={'query': query}); result = search_web(query); span.end(output={'result': result}); trace.update(output={'answer': final_answer})`. This gives you explicit control over what each span captures — useful for custom agent architectures where the automatic callback can't see the full execution structure.

**Enrich traces with metadata for downstream filtering.** Tag each trace with metadata that enables filtering in the Langfuse dashboard: `trace = langfuse.trace(name='agent-run', metadata={'topic': topic, 'agent_version': '1.2', 'model': 'claude-sonnet-4-6', 'crew_type': '3-agent-research'})`. Good metadata fields: user_id (for per-user quality analysis), session_id (for conversation-level filtering), agent version (for version comparison), model name (for model A/B analysis), task type (for per-task-type quality breakdown). Without metadata, your traces are a single undifferentiated stream — hard to filter, hard to analyze. Source: Langfuse trace metadata docs.

**Token cost tracking.** Langfuse automatically captures token usage from LangChain-instrumented calls (reads from `LLMResult.llm_output.usage`). For direct SDK instrumentation, set cost manually: `span = trace.span(name='claude-call', input=prompt, output=response, usage={'input': input_tokens, 'output': output_tokens, 'unit': 'TOKENS'}, model='claude-sonnet-4-6')`. Langfuse uses its model cost table (updated with Anthropic and OpenAI pricing periodically) to compute dollar cost from token counts. View cumulative cost per trace, per user, and per day in the analytics dashboard. Alert if daily cost exceeds your budget. Pricing reference: Anthropic pricing, OpenAI pricing.

**Verify tracing is working.** After running a test query through your agent, open cloud.langfuse.com, navigate to Traces, and look for your trace. Click into it to see the full execution tree: spans for each LLM call, tool call, and retrieval step, with inputs, outputs, latency, and token counts for each. If you see the trace, instrumentation is working. If you see no traces, check: LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY environment variables are set, the Langfuse host URL is correct, and the callback handler is being passed to every agent invocation. Source: Langfuse quickstart guide.


Step 2: build a golden dataset

**A dataset is your agent's test suite.** It's a curated list of input/expected_output pairs that represent the range of inputs your agent should handle well. The golden dataset is what you run regression tests against — 'does this model/prompt change break anything on the test suite?' Without a dataset, every change to your agent is a leap of faith. With a 30-50 example dataset, every change is a measurable comparison. Source: Langfuse datasets docs.

**Create a dataset via the API.** `langfuse = Langfuse(); dataset = langfuse.create_dataset(name='research-agent-golden-v1', description='30 representative research queries with expected output quality criteria')`. Add items: `langfuse.create_dataset_item(dataset_name='research-agent-golden-v1', input={'query': 'What is the market size of the autonomous vehicle market in 2026?'}, expected_output={'required_elements': ['market size in dollars', 'CAGR', 'at least 1 credible source', 'date-stamped data'], 'quality_score_threshold': 0.8})`. The `expected_output` format is flexible — it can be a full expected response string (for exact-match scoring) or a schema of required elements (for LLM-as-judge scoring).

**Source your dataset from production traces.** After running the agent in production for 1-2 weeks, identify the 30-50 most representative traces — the ones that cover your task distribution (not just the easy cases). In the Langfuse UI: Traces → filter by session metadata → select representative examples → 'Add to dataset'. This is more reliable than synthesizing test cases from scratch because real production inputs catch edge cases you wouldn't think to invent. **The 30 most representative real production traces make a better test suite than 100 synthetic examples.**

**Include edge cases and failure modes.** A good dataset includes not just the typical cases but also: ambiguous queries (where the model might hallucinate), long queries (that stress context window limits), multi-part queries (that require multiple tool calls), queries with recent events (that require current information), and queries that previously caused failures (regression tests for fixed bugs). Label each item with its type: `metadata={'category': 'edge_case', 'failure_mode': 'hallucination'}`. This lets you filter results by category and understand which types of queries have the most quality issues.

**Dataset versioning.** Create a new dataset version when you make significant changes to what good output looks like: `dataset_v2 = langfuse.create_dataset(name='research-agent-golden-v2')`. Keep old versions — they're the baseline for historical comparisons. When you change your quality criteria (new required elements, different score threshold), run both old and new criteria against the same traces to understand the impact. Never modify a golden dataset in place; always create a new version. Source: Langfuse dataset versioning docs.

**Minimum viable dataset size.** 30 examples is the minimum for reliable regression testing — it provides enough statistical power to detect a 5-point quality regression with 80% confidence at the typical variance of LLM outputs. 50-100 examples is the production standard. Don't try to create 200 examples at once — build to 30, ship to production, add the most interesting new inputs from production weekly. A dataset grows best incrementally from real usage data, not from a one-time brainstorming session. Once your dataset exceeds 50 examples, begin prioritizing the 30 that discriminate most between good and bad model versions.


Step 3: set up automated LLM-as-judge scoring

**LLM-as-judge is the standard eval approach for open-ended agent outputs.** Because agent outputs are long-form text (research briefs, code, analyses) rather than structured labels, automated exact-match scoring doesn't work well. Instead, use a separate LLM call to score each output against a rubric. Langfuse provides a built-in eval pipeline that runs LLM-as-judge scoring on all traces automatically. Source: Langfuse evals docs.

**Configure a Langfuse eval template.** In the Langfuse UI: Evals → Templates → New Template. Define the scoring prompt: `Name: factual_accuracy; Template: 'Score the following research output on factual accuracy (0-10). Required criteria: (1) all statistics include a source and date (2) no contradictions between facts (3) no statements that appear to be hallucinated. Output JSON: {score: int, reasoning: str}. Input: {{input}}, Output: {{output}}, Expected: {{expected_output}}'`. The `{{input}}`, `{{output}}`, and `{{expected_output}}` placeholders are substituted with the trace's actual data. The LLM scores each trace and Langfuse stores the score alongside the trace.

**Define multiple evaluation dimensions.** A single 'quality' score is insufficient for debugging — you need dimension-specific scores to know whether a regression is in factual accuracy, structure, completeness, or tone. Define separate eval templates for each dimension: `factual_accuracy` (0-10), `structure_compliance` (0-10, does the output have the required sections?), `completeness` (0-10, are all required elements present?), `source_quality` (0-10, are sources credible and recent?). Run all four on every dataset run. When a regression appears, the dimension scores tell you exactly what changed.

**Run automated evals on your dataset.** `langfuse.run_dataset_evaluation(dataset_name='research-agent-golden-v1', run_name='sonnet-4-6-v1.2', trace_metadata={'model': 'claude-sonnet-4-6', 'agent_version': '1.2'}, eval_template_names=['factual_accuracy', 'structure_compliance', 'completeness', 'source_quality'])`. This creates a dataset run object in Langfuse that re-runs all dataset items through your agent, collects traces, and scores each trace with the specified eval templates. A dataset run for 30 items with 4 eval dimensions on GPT-5.4-mini as the judge model costs approximately: 30 × 4 × (1,500 avg input + 200 output) × $2.50/$15 per 1M = $0.54 per dataset run. Run it on every pull request for $0.54. Source: Langfuse LLM-as-judge docs.

**Human scoring via annotation queues.** For high-stakes quality decisions (model upgrades, major prompt changes), supplement LLM-as-judge with human scoring via Langfuse's annotation queue. In the UI: Annotations → New Queue → add the dataset run's traces → assign to reviewers. Reviewers see the input and output side-by-side and rate on each dimension. Human scores are stored alongside automated scores and can be used to calibrate the LLM judge — compare human scores vs judge scores to tune the judge's rubric. A 30-item human annotation pass takes 1-2 hours per reviewer and provides the ground truth for all future automated comparisons.

**Choose the judge model carefully.** The judge model should be more capable than the agent model for reliable scoring. If your agent runs on Sonnet 4.6, use Opus 4.7 or GPT-5.5 as the judge. If your agent runs on GPT-5.4, use GPT-5.5 as the judge. Using the same model as both agent and judge creates a self-referential eval — the judge has the same failure modes as the agent and misses the same mistakes. Using a weaker model as judge introduces systematic blind spots. The cost of running Opus 4.7 as judge on 30 traces is approximately: 30 × (1,500 + 300) tokens × $15/1M = $0.81 — cheap for the confidence it provides.


Step 4: read quality dashboards and set up alerts

**Langfuse's analytics dashboard** aggregates trace data and eval scores across time, model versions, and task types. The key views: (1) Score over time — line chart of average quality scores per day. A downward trend is a quality regression; look for the model or prompt change that coincides with the drop. (2) Score by model — compare average scores across model versions. Shows the quality impact of model upgrades. (3) Score by trace metadata — filter scores by task type, user cohort, or agent version. Shows whether regressions are global or isolated to specific inputs. (4) Cost over time — daily LLM spend. Spikes correlate with agent behavior changes, not just traffic growth. Source: Langfuse analytics docs.

**The daily quality metric to track.** Define a single top-level quality metric: `composite_score = 0.4 × factual_accuracy + 0.3 × completeness + 0.2 × structure + 0.1 × source_quality`. Compute this for every dataset run and plot it over time. A composite score < 7.0 triggers a code freeze — no new model or prompt changes until the score recovers. A score < 6.0 triggers a rollback. Set these thresholds based on your acceptable quality floor, not arbitrary round numbers. Review and recalibrate thresholds quarterly as your agent's capability and your users' expectations evolve.

**Set up alerts for quality regressions.** Langfuse Pro supports webhook alerts when scores drop below a threshold. Configure: `Settings → Alerts → New Alert → Metric: factual_accuracy, Condition: < 6.5, Action: POST https://your-webhook/alert`. The webhook payload includes the eval run name, score, and a link to the Langfuse dashboard. Route to your team's Slack channel or PagerDuty. For Hobby plan (no webhooks), schedule a daily script that reads scores via the Langfuse API and sends a Slack alert: `from langfuse import Langfuse; scores = langfuse.get_dataset_run_scores(run_name='daily-eval'); if scores['factual_accuracy'] < 6.5: slack_alert('Quality regression detected!')`. Source: Langfuse API reference.

**Cost anomaly detection.** Langfuse tracks token cost per trace. Build a daily cost alert: compute the 7-day rolling average cost per trace and alert if today's average is more than 50% above the rolling average. A sudden cost spike without a traffic increase means an agent behavior change (more tool calls per query, longer outputs, broken caching). This alert catches runaway costs before they accumulate. `daily_avg_cost = langfuse.get_traces_aggregate(from_timestamp=yesterday, metrics=['cost'])['avg_cost']; rolling_avg = compute_7d_rolling_avg(); if daily_avg_cost > rolling_avg * 1.5: alert_team()`.

**Per-user quality analysis.** If your agent serves multiple users or tenants, track quality per user cohort. Filter traces by `user_id` metadata and compare score distributions. Users with low quality scores (< 6.0 average) may have unusual query patterns that your agent handles poorly — or they may be hitting a specific edge case. Surface the top 5 lowest-quality user cohorts weekly and review their traces manually. This is how you find systematic failure modes that your dataset doesn't yet cover — add representative examples from low-quality users to your dataset. Source: Langfuse user segmentation docs.

**Production quality gates.** Once your eval pipeline is running, integrate it into your deployment workflow: before any production deployment, run the dataset evaluation and require a minimum composite score of 7.0 to proceed. This gates deployments on quality, not just on unit tests passing. Implement as a CI/CD step: `make eval; if score < 7.0; then exit 1; fi`. A failed eval blocks the deployment. Pair with the trace replay workflow (Step 5) for detailed debugging of why the score dropped. Source: Langfuse CI/CD integration docs.


Step 5: trace replay for regression testing and debugging

**Trace replay runs a saved trace's input through a new agent configuration and compares scores.** It's the core debugging workflow for regressions: you changed the model or prompt, the score dropped, and you need to know exactly which inputs regressed and why. Replay the dataset against both the old and new configuration, compare trace-by-trace scores, and read the traces where the score dropped to understand the failure mode. Source: Langfuse trace comparison docs.

**The replay workflow in Python.** Load dataset items, run each through the new configuration, and create a new dataset run: `dataset = langfuse.get_dataset('research-agent-golden-v1'); run = langfuse.create_dataset_run(dataset_name='research-agent-golden-v1', run_name='gpt-5.5-v1.3'); for item in dataset.items: with item.observe(run_name=run.name) as trace: output = agent.run(item.input['query']); trace.update(output={'answer': output}); trace.score(name='factual_accuracy', value=score_trace(item.input, output, item.expected_output))`. Each `item.observe()` context manager creates a trace linked to the dataset item, enabling direct comparison between runs.

**Comparing two runs side by side.** Langfuse's dataset run comparison view shows both runs simultaneously: for each dataset item, the old run's output and score on the left, the new run's output and score on the right. Sort by score delta (old - new) descending to see the biggest regressions first. For each regressed item, read the full trace: which tool calls were different? Did the model skip a required reasoning step? Did the output length change? The trace comparison usually reveals the failure mode within 5 minutes — much faster than trying to reason about aggregate score changes. Source: Langfuse dataset runs docs.

**Automated regression testing in CI/CD.** Run trace replay automatically on every pull request: `# .github/workflows/eval.yml; name: Eval; on: [pull_request]; jobs: eval: runs-on: ubuntu-latest; steps: - uses: actions/checkout@v3; - run: pip install langfuse; - run: python scripts/run_eval.py; - run: python scripts/compare_scores.py --baseline main --candidate HEAD --min-score 7.0`. The `compare_scores.py` script loads the baseline (main branch eval run) and candidate (PR eval run) scores, computes the delta, and exits 1 if any key dimension drops more than 0.5 points. This blocks PRs that introduce quality regressions before they merge.

**Debugging with trace inspector.** When a trace shows a quality regression, open it in the Langfuse trace inspector: you see the full execution tree with each LLM call's exact input and output, tool call arguments and results, latency per step, and token counts. For LangGraph agents, you see the node-by-node traversal. For CrewAI crews, you see each agent's thoughts, tool calls, and task completion message. This is the fastest way to answer 'why did the quality drop?' — look at the exact inputs the model saw and the exact outputs it produced. No guessing, no log parsing, no reproduction needed. Source: Langfuse trace viewer docs.

**The eval flywheel.** The full workflow creates a virtuous cycle: instrument → collect traces → build dataset from best examples → run evals → monitor dashboards → find regressions via trace replay → fix the agent → re-run dataset → confirm score recovers → repeat. Each iteration makes the dataset more representative, the scores more reliable, and the agent more robust. Teams that run this flywheel weekly ship higher-quality agents than teams that eval quarterly. The infrastructure cost is low (Langfuse Hobby is free; LLM judge costs ~$0.50-$1.00 per dataset run); the quality cost of skipping eval is high. See Langfuse pricing for tier details.


Integrating Langfuse with LangGraph and CrewAI

**LangGraph + Langfuse integration.** The LangChain callback handler captures LangGraph execution automatically. One change: pass the handler to every graph invocation: `result = graph.invoke(input, config={'callbacks': [langfuse_handler], 'configurable': {'thread_id': thread_id}})`. Langfuse creates one root trace per `invoke()` call with nested spans for each LangGraph node traversal. Node names in Langfuse match the node names you set in `graph_builder.add_node()` — use descriptive names ('researcher', 'writer', 'tool_executor') rather than defaults ('node1', 'node2'). Source: LangGraph observability docs.

**CrewAI + Langfuse integration.** CrewAI is LangChain-compatible. Pass the Langfuse callback to each agent's LLM configuration: `llm_with_langfuse = ChatAnthropic(model='claude-sonnet-4-6', callbacks=[langfuse_handler]); researcher = Agent(..., llm=llm_with_langfuse)`. The callback captures each agent's LLM calls as separate spans. To group all spans from one crew run into a single root trace, use Langfuse's `trace()` context manager: `with langfuse.trace(name='research-crew', input=inputs) as trace: langfuse_handler.trace = trace; result = crew.kickoff(inputs=inputs); trace.update(output=result.raw)`. This wraps all agent spans under one trace object for clean per-run analysis.

**Prompt management via Langfuse.** Store your agent prompts in Langfuse's prompt management system instead of hardcoding them: `prompt = langfuse.get_prompt('researcher-system-prompt'); system_message = prompt.compile(topic=topic)`. When you change the researcher's system prompt, create a new version in Langfuse and deploy via the `langfuse.get_prompt()` call — no code deployment required. Each trace automatically records which prompt version was used, so you can filter eval results by prompt version and see exactly how a prompt change affected quality. This is the production pattern for prompt iteration without code releases.

**Pydantic AI and Langfuse.** If you're using Pydantic AI (another strongly-typed agent framework), Langfuse integrates via the OpenTelemetry exporter: `from pydantic_ai.instrumentation import PydanticAIInstrumentor; from langfuse.opentelemetry import LangfuseExporter; PydanticAIInstrumentor(exporter=LangfuseExporter()).instrument()`. All Pydantic AI traces are sent to Langfuse using the same trace format. Pydantic AI's type-validated tool calls and agent outputs make LLM-as-judge scoring easier — expected outputs can be Pydantic models with exact field validation. Source: Pydantic AI instrumentation docs.

**Cost attribution across agent frameworks.** Langfuse's model cost tracking works the same way regardless of which framework you use — it reads token usage from each LLM call's metadata. Set `model_name` on each span for correct cost attribution when using custom or fine-tuned models: `span.update(model='claude-sonnet-4-6', usage={'input': tokens_in, 'output': tokens_out})`. Run the Langfuse cost dashboard weekly and review the cost-per-trace trend. If average cost per trace is rising without a quality improvement, investigate which agent step is consuming more tokens — usually a prompt change that made system prompts longer or a tool result size increase. See tool use overhead calculator for the components to profile.

**AgentOps as an alternative.** AgentOps is an alternative observability platform that focuses on agent-specific metrics: agent replay, cost tracking, and error monitoring. It integrates with CrewAI natively (`agentops.init(api_key='...')` before crew instantiation — one line) and with LangChain via callback. The main tradeoff vs Langfuse: AgentOps has better CrewAI-native support and a simpler setup, but fewer eval/dataset features. Langfuse has a more complete eval pipeline and prompt management. For teams where CrewAI is the primary framework, AgentOps is worth evaluating alongside Langfuse.


Advanced eval techniques: evals at scale and cost optimization

**Sampling instead of full-dataset evals.** At 100K+ queries/month, running LLM-as-judge on every trace is expensive. Use statistical sampling: eval a random 2-3% of production traces daily (2,000-3,000 traces), aggregated into weekly trend reports. For quality-regression detection, 2% sampling provides 80%+ confidence to detect a 5-point regression at typical LLM output variance. Use Langfuse's `get_traces()` API with random sampling: `import random; all_traces = langfuse.get_traces(limit=100000); sampled = random.sample(all_traces, int(len(all_traces) * 0.02))`. This reduces eval LLM costs by 98% vs scoring all traces, with minimal statistical power loss for regression detection. Source: Langfuse API reference.

**Tiered eval: fast cheap scores first, expensive deep evals on failures.** Run a fast, cheap first-pass judge (GPT-5.4-mini, $0.00075 per eval call) on every sampled trace. For traces scoring below 7.0 on the fast pass, run the expensive deep eval (Opus 4.7 as judge, $0.081 per eval call). This tiered approach applies expensive evaluation only where it adds signal — traces that score well on the fast pass are unlikely to be deep failures. `fast_score = fast_judge.score(trace); if fast_score < 7.0: deep_score = deep_judge.score(trace)`. At 2,000 sampled traces/day with 15% falling below 7.0: 2,000 × $0.00075 + 300 × $0.081 = $1.50 + $24.30 = $25.80/day — more informative than all-Opus scoring at $162/day.

**Eval caching for stable inputs.** Some inputs in your golden dataset are identical across every eval run (static test cases). Cache the LLM judge's scores for these: `cache_key = hash(json.dumps({'input': item.input, 'expected': item.expected_output})); if cache_key in score_cache: score = score_cache[cache_key]`. Only re-score when the judge rubric changes or the agent's output changes. At 30 static dataset items, caching saves 30 × $0.027 (Opus judge cost) = $0.81 per eval run — modest, but adds up to $10-$30/month of savings on frequent CI/CD evals.

**Bias detection in LLM-as-judge scoring.** LLM judges have systematic biases: verbosity bias (prefer longer outputs), self-agreement bias (Sonnet scoring Sonnet outputs too favorably), recency bias (prefer outputs that mention recent events). Detect these in your data: `verbose_correlation = correlation(output_length, score)`. If Pearson r > 0.4, your judge has verbosity bias. Fix by explicitly instructing the judge: 'Score only based on accuracy and completeness. Longer responses should not receive higher scores than shorter responses that cover the same information.' Re-calibrate quarterly by computing Pearson correlations across your scored traces. Source: Langfuse evaluation best practices.

**Online eval vs offline eval.** Offline eval is what we've described — scoring traces after the fact, on datasets. Online eval scores production traces in real-time as part of the agent pipeline. Online eval adds latency (1-2 seconds for the judge call) but catches quality issues before the output is delivered to the user. For high-stakes outputs (contract drafts, financial reports, medical summaries), online eval acts as a quality gate: if score < 7.0, regenerate before delivering. Online eval cost: every production call becomes 1 agent call + 1 judge call. On Sonnet 4.6 ($0.137/task) + fast judge ($0.00075/eval) = $0.138/task — a 0.5% cost increase for a quality gate on every output. Justified for high-stakes outputs; overkill for routine content generation.

**Eval-driven prompt optimization.** Use eval scores as the objective function for prompt optimization. Run the eval dataset with 3-5 prompt variants (different instruction phrasings, different context structures). Compare scores. The highest-scoring variant is the better prompt. This turns prompt engineering from intuition-based to data-driven: `variants = [prompt_v1, prompt_v2, prompt_v3]; scores = [run_dataset_eval(variant) for variant in variants]; best_prompt = variants[scores.index(max(scores))]`. At $0.54 per dataset run (30 items, GPT-5.4-mini judge), comparing 5 prompt variants costs $2.70 — cheap for the confidence of data-driven prompt selection. Integrate into your prompt iteration workflow: every prompt change gets a dataset eval score before shipping. Source: Langfuse prompt management.


Production eval playbook

**Week 1: instrument everything, collect baseline traces.** Add the Langfuse callback to all agent invocations. Deploy to a small production cohort (5-10% of traffic). Let traces accumulate for 5-7 days. Don't analyze yet — you need enough volume to get a representative sample. Check that traces are appearing in Langfuse and that token counts + costs are being captured correctly.

**Week 2: build the golden dataset from production traces.** Select 30-50 representative traces from the first week. Prioritize: diverse query types, both easy and hard cases, any traces that produced user complaints (filter by user_id, cross-reference with support tickets). Add them to a Langfuse dataset. For each item, write the expected_output as a list of required elements, not a full expected response string — this makes the LLM judge more robust to valid variation in phrasing.

**Week 3: run the first eval and calibrate the judge.** Run the dataset against your current agent configuration. Score with your LLM-as-judge templates. Compare judge scores vs manual review of 10 items — do the judge scores match your human judgment? If not, revise the judge rubric. A judge that consistently disagrees with human raters is worse than no judge. Calibrate until the judge's p50 and p90 scores align with human ratings within ±1 point on a 0-10 scale.

**Week 4+: integrate into deployment workflow.** Add the eval run to your CI/CD pipeline. Set minimum score thresholds. Run the full dataset eval before every model or prompt change. Treat a score drop > 0.5 points on any dimension as a blocking regression. Review the failing traces before merging the change. Add new items to the dataset as new edge cases surface in production. The eval flywheel is now running — iterate continuously.

**Ongoing: monthly eval retrospectives.** Review the full dataset monthly: remove items that are no longer representative, add items from new failure modes, recalibrate thresholds based on the score distribution. Track 3 metrics: dataset coverage (does the dataset cover your current production input distribution?), judge calibration (does the judge match human raters on new traces?), and score trend (is quality improving, stable, or declining over the past 30 days?). These three metrics tell you whether your eval pipeline is still useful or needs renovation. Source: Langfuse evaluation best practices.

Set up your agent eval pipeline with Langfuse in 5 steps

  1. 1

    Instrument your agent with the Langfuse callback

    Install `langfuse`. For LangChain/LangGraph/CrewAI: `from langfuse.callback import CallbackHandler; langfuse_handler = CallbackHandler(public_key='pk-lf-...', secret_key='sk-lf-...')`. Pass to every agent invocation: `graph.invoke(input, config={'callbacks': [langfuse_handler]})`. Verify traces appear in cloud.langfuse.com. Add trace metadata — `user_id`, `session_id`, `agent_version`, `model` — to every trace for downstream filtering. Enable token cost tracking by ensuring model name is set on each LLM call span.

  2. 2

    Build a 30-50 item golden dataset from production traces

    After 1-2 weeks of production tracing, open Langfuse Traces and select 30-50 representative examples. Prioritize diverse query types and any inputs that produced user complaints or unusual outputs. For each item, define `expected_output` as a list of required elements (not a full response string). Add via the UI ('Add to dataset') or API. Include edge cases: ambiguous queries, recent-event queries, multi-part queries. This dataset is your regression test suite — build it before you need it.

  3. 3

    Set up LLM-as-judge scoring with multiple dimensions

    In Langfuse UI: Evals → Templates → New Template. Create 3-4 dimension templates: `factual_accuracy`, `completeness`, `structure_compliance`, and `source_quality`. Each template has a scoring prompt with `{{input}}`, `{{output}}`, `{{expected_output}}` placeholders and outputs `{score: int (0-10), reasoning: str}`. Use a more capable model than your agent as the judge (Opus 4.7 or GPT-5.5 to judge Sonnet 4.6 or GPT-5.4 outputs). Calibrate by comparing judge scores to your manual ratings on 10 items before running at scale.

  4. 4

    Run dataset evaluations and monitor the quality dashboard

    Run `langfuse.run_dataset_evaluation(dataset_name='your-dataset', run_name='v1.0-baseline', eval_template_names=['factual_accuracy', 'completeness', 'structure_compliance'])`. Open the Langfuse dashboard after the run completes: check composite scores, score distributions, and per-dimension breakdowns. Set alert thresholds: composite < 7.0 = investigation required; < 6.0 = rollback. Add eval run to CI/CD: run on every PR, block merge if any score drops > 0.5 from the baseline run. The dataset run costs ~$0.54 on GPT-5.4-mini as judge for 30 items — run it on every PR.

  5. 5

    Replay traces to debug regressions

    When a score drops after a model or prompt change: open the dataset run comparison view in Langfuse. Sort items by score delta (old - new) descending. Open the top 5 regressed items and read the full trace — input, each agent step's output, tool calls, tool results. The regression cause is almost always visible in the trace within 5 minutes. Common causes: model didn't call required tools, tool result size changed, new prompt is longer (context shifted cached prefix). Fix the cause, re-run the dataset, confirm score recovers before merging.

Frequently Asked Questions

What is Langfuse and why use it for agent evaluation?

Langfuse is an open-source LLM observability and evaluation platform. It captures traces (full agent execution records with LLM calls, tool calls, inputs, outputs, latency, and cost), manages eval datasets, runs LLM-as-judge scoring, and provides quality dashboards. It's the most complete eval stack for LLM agents in 2026, with native integrations for LangChain, LangGraph, CrewAI, and Pydantic AI. Hobby plan is free forever (50K traces/month); Pro is $59/month for unlimited traces. Source: Langfuse docs (https://langfuse.com/docs), pricing (https://langfuse.com/pricing).

How do I integrate Langfuse with LangGraph?

Pass the Langfuse callback handler to every graph invocation: `from langfuse.callback import CallbackHandler; handler = CallbackHandler(public_key='...', secret_key='...'); result = graph.invoke(input, config={'callbacks': [handler]})`. Langfuse automatically captures every LangGraph node traversal, LLM call, and tool execution as nested spans under one root trace. Node names in Langfuse match your `add_node()` names — use descriptive names. No changes to graph logic required. Source: Langfuse LangChain integration docs at langfuse.com/docs.

What is LLM-as-judge scoring and how accurate is it?

LLM-as-judge uses a separate, more capable LLM to score agent outputs against a rubric. Accuracy depends on rubric quality and judge model capability. A well-calibrated judge (rubric validated against 10+ human ratings) correlates with human judgment at 0.75-0.90 Pearson correlation — good enough for reliable regression detection. Common failure modes: vague rubrics produce inconsistent scores, using the same model as both agent and judge creates self-referential blind spots, and judge models can be sycophantic (prefer verbose outputs). Calibrate by comparing judge scores to human ratings on a sample before using at scale.

How many examples do I need in my eval dataset?

30 examples is the minimum for reliable regression testing — enough statistical power to detect a 5-point quality regression with 80% confidence at typical LLM output variance. 50-100 is the production standard. Source from real production traces (not synthetic examples) after 1-2 weeks of tracing. Include: typical inputs (60%), edge cases (25%), previously-failed inputs (15%). Don't exceed 100 items in your first version — prioritize representativeness over size. A 30-item dataset from real production data outperforms a 200-item synthetic dataset for catching real regressions.

How do I use Langfuse for A/B testing two model versions?

Run the same dataset against both model configurations and compare run scores in Langfuse's dataset run comparison view. Procedure: (1) create run A with model version 1 (`run_name='claude-sonnet-4-6-v1'`), (2) create run B with model version 2 (`run_name='gpt-5.5-v1'`), (3) open Langfuse → Datasets → your dataset → compare runs. The comparison view shows side-by-side scores per item and aggregate score distributions. Items where A > B identify tasks the old model handles better; items where B > A identify improvements. Use a t-test or Wilcoxon test on the score distributions to determine statistical significance before making the model switch decision.

What is trace replay in Langfuse?

Trace replay reruns a saved trace's input through a new agent configuration and compares the output and score to the original trace. It's the primary debugging tool for regressions: you changed something, the score dropped, and you need to know which specific inputs regressed. Load the regressed dataset run, sort items by score drop, and open the top failures in the trace inspector to see exactly what the agent did differently. No reproduction needed — the exact input is saved in the dataset item and the trace shows every step the agent took. Source: Langfuse trace comparison docs at langfuse.com/docs.

How much does running Langfuse cost?

Langfuse itself is free (Hobby: 50K traces/month, 30-day retention; self-hosted: unlimited, free). The only cost is the LLM judge model calls for automated eval scoring. At 30 dataset items × 4 eval dimensions × (1,500 input + 200 output tokens) on GPT-5.4-mini ($0.75/$4.50 per 1M): 30 × 4 × ($0.75 × 0.0015 + $4.50 × 0.0002) = $0.27 in judge LLM costs per dataset run. Using Opus 4.7 as judge: 30 × 4 × ($15 × 0.0015 + $75 × 0.0002) = $3.60 per run. Source: Langfuse pricing (https://langfuse.com/pricing), Anthropic pricing (https://docs.anthropic.com/en/docs/about-claude/pricing).

Can I use Langfuse with CrewAI?

Yes. Pass the Langfuse callback handler to each agent's LLM: `from langfuse.callback import CallbackHandler; handler = CallbackHandler(...); llm = ChatAnthropic(model='claude-sonnet-4-6', callbacks=[handler]); agent = Agent(..., llm=llm)`. For clean per-crew-run traces, wrap the kickoff in a Langfuse trace context: `with langfuse.trace(name='research-crew', input=inputs) as t: handler.trace = t; result = crew.kickoff(inputs=inputs); t.update(output=result.raw)`. This groups all agent spans under one trace. Alternatively, use AgentOps for simpler CrewAI integration (one-line setup) with fewer eval features. Source: CrewAI docs (https://docs.crewai.com/), Langfuse docs (https://langfuse.com/docs).

Eval tells you if your agent works. Prompts determine how well.

Our AI Prompt Generator builds eval-ready agent prompts — structured for consistent scoring, precise enough to grade, concise enough to keep costs low. Works with Claude, GPT-5, and every eval framework. 14-day free trial, no card.

Browse all prompt tools →