Step 1: instrument your agent with traces
**Traces are the foundation of every eval pipeline.** A trace is a hierarchical record of one complete agent execution — the root trace captures the full run, child spans capture individual LLM calls, tool calls, and retrieval steps. Langfuse stores traces with full input/output, token counts, latency, and cost metadata. Without traces, you're debugging production issues from symptoms (user complaints, cost anomalies) rather than from the actual execution data. Source: Langfuse tracing docs.
**The fastest integration: LangChain callback handler.** If your agent is built on LangChain, LangGraph, or CrewAI (all LangChain-compatible), the Langfuse callback captures traces automatically: `from langfuse.callback import CallbackHandler; langfuse_handler = CallbackHandler(public_key='pk-lf-...', secret_key='sk-lf-...', host='https://cloud.langfuse.com'); result = graph.invoke({'messages': [HumanMessage(content=query)]}, config={'callbacks': [langfuse_handler]})`. Every LLM call, tool call, and node traversal in the agent is captured as a nested span in Langfuse. Zero code changes to your agent logic required.
**Manual instrumentation with the Langfuse SDK.** For agents not built on LangChain, use the Python SDK directly: `from langfuse import Langfuse; langfuse = Langfuse(); trace = langfuse.trace(name='research-agent', user_id=user_id, session_id=session_id, input={'query': query}); span = trace.span(name='web-search', input={'query': query}); result = search_web(query); span.end(output={'result': result}); trace.update(output={'answer': final_answer})`. This gives you explicit control over what each span captures — useful for custom agent architectures where the automatic callback can't see the full execution structure.
**Enrich traces with metadata for downstream filtering.** Tag each trace with metadata that enables filtering in the Langfuse dashboard: `trace = langfuse.trace(name='agent-run', metadata={'topic': topic, 'agent_version': '1.2', 'model': 'claude-sonnet-4-6', 'crew_type': '3-agent-research'})`. Good metadata fields: user_id (for per-user quality analysis), session_id (for conversation-level filtering), agent version (for version comparison), model name (for model A/B analysis), task type (for per-task-type quality breakdown). Without metadata, your traces are a single undifferentiated stream — hard to filter, hard to analyze. Source: Langfuse trace metadata docs.
**Token cost tracking.** Langfuse automatically captures token usage from LangChain-instrumented calls (reads from `LLMResult.llm_output.usage`). For direct SDK instrumentation, set cost manually: `span = trace.span(name='claude-call', input=prompt, output=response, usage={'input': input_tokens, 'output': output_tokens, 'unit': 'TOKENS'}, model='claude-sonnet-4-6')`. Langfuse uses its model cost table (updated with Anthropic and OpenAI pricing periodically) to compute dollar cost from token counts. View cumulative cost per trace, per user, and per day in the analytics dashboard. Alert if daily cost exceeds your budget. Pricing reference: Anthropic pricing, OpenAI pricing.
**Verify tracing is working.** After running a test query through your agent, open cloud.langfuse.com, navigate to Traces, and look for your trace. Click into it to see the full execution tree: spans for each LLM call, tool call, and retrieval step, with inputs, outputs, latency, and token counts for each. If you see the trace, instrumentation is working. If you see no traces, check: LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY environment variables are set, the Langfuse host URL is correct, and the callback handler is being passed to every agent invocation. Source: Langfuse quickstart guide.