Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

AI Agent Cost Calculator 2026: Per-Loop $ Math for LangGraph, Claude Agent, and Friends

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

AI agents are LLM calls that consult tools (web search, code execution, database queries, custom APIs) over multiple turns before producing a final answer. As of June 2026, a typical agent loop bills 5-15x the input tokens and 8-25x the output tokens of a single direct-answer call — because the conversation history grows with each tool call result, and every tool result gets replayed as input on the next turn.

Most teams underestimate agent cost by 5-10x at planning time and overshoot the budget within the first month of production. The fix is straightforward: model the loop properly, cache the stable system prompt, batch what can wait, and pick the right model tier for each agent role. Below is the per-loop cost formula, worked examples across frameworks and models, and the patterns that cut agent bills 50-80%. For base-model cost comparison, see our GPT vs Claude vs Gemini cost calculator, or grab the free agent-cost cheat sheet PDF.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

AI agent loop cost per 100 loops — June 2026 (typical 6-tool-call loop, 8k context buildup)

Feature
Per-loop input tokens
Per-loop output tokens
Per-loop $ (standard rate)
Per 1,000 loops
Claude Sonnet 4.6 (typical agent)~28,000~3,500$0.137$137
Claude Opus 4.8 (deep agent)~28,000~3,500$0.228$228
Claude Haiku 4.5 (lightweight agent)~28,000~3,500$0.046$46
Claude Fable 5 (reasoning agent)~28,000~7,000 (incl. reasoning)$0.630$630
OpenAI gpt-5.5 (typical agent)~28,000~3,500$0.245$245
OpenAI gpt-5.4 (typical agent)~28,000~3,500$0.123$123
OpenAI gpt-5.4-mini (high-volume agent)~28,000~3,500$0.037$37
OpenAI o4-reasoning (research agent)~28,000~10,000 (incl. reasoning)$1.020$1,020
Google Gemini 2.5 Pro (typical agent)~28,000~3,500$0.070$70
Google Gemini 2.5 Flash (high-volume agent)~28,000~3,500$0.017$17
Gemini 2.5 Pro + 80% cached prefix~28,000 (5,600 base + 22,400 cached)~3,500$0.029 (caching saves 59%)$29
Claude Sonnet 4.6 + 80% cached prefix~28,000 (5,600 base + 22,400 cached)~3,500$0.076 (caching saves 45%)$76
OpenAI gpt-5.4 + 80% cached prefix~28,000 (5,600 base + 22,400 cached)~3,500$0.072 (caching saves 42%)$72

Sources, as of June 2026: model pricing from OpenAI (https://developers.openai.com/api/docs/pricing), Anthropic (https://claude.com/pricing), Google Gemini (https://ai.google.dev/gemini-api/docs/pricing). Per-loop token estimates from a typical 6-tool-call agent with a 2,000-token system prompt + tool definitions, 6 tool results averaging 800 tokens each, and 3,500 total output tokens including tool-call arguments. Cached prefix assumes the system prompt and tool definitions are cache-eligible and remain stable across loops; cache hits bill at roughly 10% of base input on Claude and OpenAI.

Why agents cost 10x what a single call costs

An agent loop is a sequence of LLM calls within a single user-facing request. Each turn passes the full conversation history — system prompt + tool definitions + every prior message + every prior tool result — back to the model as input. The history grows with each turn.

Worked decomposition for a typical 6-tool-call agent:

Turn 1: 2,000 token system prompt + 200 token user query → 2,200 input → 200 output (tool call request)

Turn 2: 2,200 + 200 (turn 1 output) + 800 (tool result) → 3,200 input → 200 output (next tool call)

Turn 3: 3,200 + 200 + 800 → 4,200 input → 200 output

Turn 4: 4,200 + 200 + 800 → 5,200 input → 200 output

Turn 5: 5,200 + 200 + 800 → 6,200 input → 200 output

Turn 6: 6,200 + 200 + 800 → 7,200 input → 200 output

Turn 7 (final answer): 7,200 + 200 + 800 → 8,200 input → 1,500 output (the answer to the user)

Total: input tokens summed across 7 turns = 36,400. Output tokens = 7 × ~300 = ~2,100 — though the final answer adds 1,500 more, so ~3,500 total output. The same task answered without an agent would cost ~2,200 input + 1,500 output = 3,700 total tokens. The agent costs roughly 11x more on input and 2.3x more on output.

Numbers above are rounded toward the table. Real loops vary based on tool result size, number of tools, and whether the model reasons aloud between tools.


Worked example 1: 100 agent loops at typical model tiers

Reference workload: 100 user requests, each spawning a 6-tool-call agent loop. Per-loop totals: ~28,000 input + ~3,500 output (rounded for table fit; matches the schema above).

Claude Sonnet 4.6: 100 × (28k × $3/1M + 3.5k × $15/1M) = 100 × ($0.084 + $0.053) = 100 × $0.137 = $13.70.

Claude Haiku 4.5: 100 × ($0.028 + $0.018) = 100 × $0.046 = $4.60.

OpenAI gpt-5.5: 100 × ($0.14 + $0.105) = 100 × $0.245 = $24.50.

OpenAI gpt-5.4-mini: 100 × ($0.021 + $0.016) = 100 × $0.037 = $3.70.

Google Gemini 2.5 Pro: 100 × ($0.035 + $0.035) = 100 × $0.070 = $7.00.

Google Gemini 2.5 Flash: 100 × ($0.0084 + $0.00875) = 100 × $0.017 = $1.74.

For 100 loops, the spread runs $1.74 (Gemini Flash) to $24.50 (gpt-5.5) — a 14x range on identical workload. Quality varies — Gemini Flash will fail more loops than Sonnet 4.6 on harder reasoning — but for high-volume simpler agent tasks the difference is real money.


Worked example 2: 1,000 loops/day with caching

Reference workload: 1,000 agent loops per day, system prompt + tool definitions (2,000 tokens) cached. Cache write paid once per cache window; the rest are cache reads.

Without caching (Claude Sonnet 4.6 @ standard): 1,000 loops × $0.137 = $137/day = ~$4,100/month.

With 80% input caching (system prompt + tool defs cached, conversation history not cached because it grows per loop): cached portion bills at $0.30/1M, uncached at $3/1M. Per loop: 22,400 cached × $0.30/1M = $0.0067 + 5,600 uncached × $3/1M = $0.017. Plus output unchanged at $0.053. Per loop: $0.076. Daily: $76. Monthly: ~$2,280. A 44% reduction.

Stack with the Batch API where eligible (offline analysis agents, not user-facing). 50% off both input and output on batched loops. If 30% of daily loops are batchable: 700 sync loops × $0.076 + 300 batch loops × $0.038 = $53.20 + $11.40 = $64.60/day. Monthly: ~$1,940. A 53% reduction overall.

Drop one tier: same 1,000 loops on Haiku 4.5 with caching: $0.013/loop × 1,000 = $13/day = ~$390/month. A 90% reduction from the uncached Sonnet baseline. Worth it only if eval shows Haiku matches required accuracy on this agent's tool-use pattern.

Audit the agent's per-loop cost early. Most teams discover their agents cost 5-10x more than projected; the fix is almost always caching + tier drop, not refactoring the framework.


Tool-call size: the single biggest cost lever

The factor most teams overlook is tool result size. A web search that returns 4,000 tokens of content costs more on every subsequent turn because each turn replays that result as input. A 6-tool loop with 4k-token results costs roughly 2.5x what a 6-tool loop with 800-token results costs.

Compress tool results before returning them to the model. Extract the relevant snippets, summarize long responses, trim verbose JSON. A web search tool that returns 'top 3 results, 150 words each' costs far less than one returning full page content — and usually gives better agent behavior because the model is not distracted by noise.

Limit tool count. Every tool definition in the system prompt costs input tokens on every loop. A 30-tool agent has ~6,000 tokens of tool definitions; a 5-tool agent has ~1,000. If you can scope the available tools per agent role, do it. The model also reasons better with fewer choices.

Use tool selection. Some frameworks (LangGraph, OpenAI Assistants) let you dynamically restrict the available tools per turn. Provide only the relevant subset based on context. Cuts input tokens and improves selection accuracy.

For prompt-quality strategies that produce tighter tool definitions, our code prompt builder helps compress technical schemas without losing precision.


Framework-specific cost gotchas

LangGraph: state passes through each node, growing as nodes append. If your state includes the full intermediate output of each tool, the input size compounds per turn. Use state trimming nodes that summarize old context before passing to the next node — a common pattern is summarizing turn-5+ context into a 500-token recap before turn 8.

Claude Agent SDK / Anthropic Tool Use: tool results are appended to the message history exactly as returned. Anthropic's prompt caching is well-suited for this pattern — mark the system prompt + tool definitions as cache-eligible and the conversation history grows on top of cached prefix. Typical savings: 40-60% on input across multi-turn agents.

OpenAI Assistants API: maintains conversation state server-side via thread + message objects. Convenient but billed identically to passing the history yourself — there is no magic. The Assistants API does support cached threads on long-running conversations.

AutoGen: multi-agent patterns (one model orchestrating other models) multiply costs by agent count. A 3-agent AutoGen team running 6 turns each = 18 LLM calls minimum. Use the smallest competent model for the worker agents and reserve the strong tier for the orchestrator.

CrewAI: similar multi-agent multiplier. Useful pattern: use Haiku 4.5 or gpt-5.4-mini for the worker agents (search, summarize, verify), Sonnet 4.6 or gpt-5.5 for the orchestrator. Total cost typically 3-5x a single-agent loop, not 10x.


Caching for agents: the canonical setup

Step 1: identify the stable portion of your agent prompt. System prompt, tool definitions, persona, and any reference documents that do not change across turns. This is the cache-eligible prefix.

Step 2: structure the message order so the stable prefix sits first. Conversation history and tool results come after. Variable user input comes last.

Step 3: enable caching. On Claude: add cache_control: {type: 'ephemeral'} to the last cacheable message block. On OpenAI: caching is opportunistic — long stable prefixes cache automatically as of June 2026. On Gemini: explicit context caching via the Caches API; cached content has a configurable TTL.

Step 4: measure the cache-hit rate. On Anthropic, the response includes usage.cache_read_input_tokens and usage.cache_creation_input_tokens. Aim for 70-90% cache hits on agent loops with stable prefixes.

Step 5: amortize cache writes. The first call to a new prefix bills at 1.25x base input (5-minute TTL) or 2x base input (1-hour TTL). It pays off after roughly 3 reads. For agents that loop many times within a single user session, this is trivial. For agents that fire once per user session, choose the 1-hour TTL to maximize hit rate across users in the same product flow.

Caching is the highest-impact lever on agent cost. Most teams that have not enabled it are paying 2-3x more than necessary.


Picking the right model tier for each agent role

Multi-agent setups benefit from mixed-tier deployment. Use a strong model only where it matters; cheap models everywhere else.

Orchestrator (the agent that plans tool calls and synthesizes the final answer): Claude Sonnet 4.6 or OpenAI gpt-5.5. The orchestrator's quality directly drives final answer quality. Do not skimp here.

Tool-use workers (agents that execute specific tools and return results): Claude Haiku 4.5 or OpenAI gpt-5.4-mini. These usually follow tight schemas (run this query, summarize this page, parse this JSON); the strong model is overkill.

Critic / verifier (agent that checks the orchestrator's work): Claude Sonnet 4.6. Quality matters here too; mistakes by the critic cascade.

Final-answer formatter: Claude Haiku 4.5 or gpt-5.4-mini. The orchestrator has already done the reasoning; the formatter just produces the response shape.

Worked math on a typical 4-agent setup (1 orchestrator + 2 workers + 1 critic) at 1,000 loops/day: all-Sonnet 4.6 ≈ $548/day. Mixed-tier (Sonnet + 2 Haiku + 1 Sonnet) ≈ $228/day — 58% cheaper at similar end-to-end quality on most workloads. The savings compound monthly.


Sub-agent delegation patterns: how to chain cheap and strong agents for 80% cost reduction

Single-agent loops hit a ceiling. Past 8-10 tool calls, the context window fills with stale tool results, the orchestrator's reasoning quality degrades, and the per-turn cost climbs quadratically because each new turn replays everything that came before. The fix that has emerged across 2026 production deployments is the orchestrator-worker pattern: one strong agent (Sonnet 4.6, gpt-5.5, or Opus 4.8) decides what work needs doing and delegates discrete tasks to a fleet of cheaper sub-agents (Haiku 4.5, gpt-5.4-mini, Gemini 2.5 Flash), each of which operates in its own fresh context window. The orchestrator never sees the raw tool output — only the worker's compressed summary. Done well, this cuts the bill 60-80% versus a single Sonnet loop at equal or better answer quality. Done badly, it triples the bill because every worker reload pays its own system-prompt tax.

Worked comparison on a research workload (find and synthesize five sources on a technical question). Single Sonnet 4.6 loop: 12 tool calls, ~62,000 cumulative input tokens, ~5,000 output. Bill: $0.261 per query. Orchestrator-worker version: Sonnet 4.6 orchestrator runs a 4-call planning loop (~12,000 input, 1,200 output = $0.054), spawns 5 parallel Haiku 4.5 search workers each with a 1,500-token scoped prompt and 3 tool calls returning a 400-token summary (~8,000 input + 600 output per worker × 5 = $0.032 + $0.006 = $0.038 total), then a final Sonnet 4.6 synthesizer takes the 5 summaries (~4,500 input + 1,500 output = $0.036). Grand total: $0.128 per query — a 51% cut. End-to-end latency drops too because the 5 workers run concurrently rather than sequentially in one loop.

The sub-agent count is a real tradeoff, not a free lever. Too few workers and the orchestrator still does most of the reasoning itself, which means strong-tier tokens get spent on grunt work; the cost barely moves. Too many workers and three problems compound: each worker pays its own ~1,500-token system-prompt-plus-tool-definitions setup cost (which is not amortized across the swarm), the orchestrator burns tokens reading and merging N summaries, and coordination failures (workers redoing the same work, missing the brief) drag down quality. The sweet spot for most production agents is 3-6 workers per orchestrator turn. Above 8 workers, the per-worker setup tax overtakes the tier-drop savings and the bill starts climbing again.

Map-reduce is the workhorse pattern when the input divides cleanly. The orchestrator partitions the work (5 documents, 12 log shards, 30 product reviews), spawns one cheap worker per chunk to extract or score, then merges the structured outputs. Cost profile: linear in chunk count, no history accumulation per worker because each worker sees only its chunk. Real numbers on a 30-document classification task: single Sonnet loop replaying all 30 docs in context = ~$0.84 per run; map-reduce with 30 Haiku workers + Sonnet merger = ~$0.19 per run, a 77% cut. Worth the orchestration code when chunk count exceeds 5 and chunks fit in worker context.

Critic-loop pairs a generator with a verifier. The generator (often cheap — Haiku 4.5 or gpt-5.4-mini) drafts an answer; the critic (strong — Sonnet 4.6 or Opus 4.8) inspects it for errors and either approves or returns specific corrections. Each loop costs the sum of one cheap call and one strong call, typically $0.04-$0.08 per iteration, and 1-3 iterations resolves most tasks. Net cost is comparable to a single Sonnet call but with measurably higher accuracy on tasks where mistakes are easy to spot but hard to avoid (code generation, structured extraction, factual claims). Skip this pattern when the critic cannot reliably distinguish good answers from bad — debugging a broken critic burns money without improving quality.

Planner-executor splits the strong-model reasoning from the bulk execution. A Sonnet 4.6 or Opus 4.8 planner produces a structured 5-15 step plan in one call ($0.02-$0.06), then a Haiku 4.5 or gpt-5.4-mini executor runs each step with tight scope and no need to re-plan. The executor never sees the full problem — only the current step plus relevant tool results — which keeps its context window small. Useful when steps are independent or only loosely coupled. Debate (N independent models propose answers, a judge picks the best) is the most expensive pattern in this family and worth the cost only when answer correctness has high downstream stakes (legal review, medical triage, financial decisions). Three-model debate at Sonnet 4.6 + Sonnet 4.6 + Opus 4.8 with an Opus 4.8 judge runs roughly $0.85 per decision — reserve for cases where a wrong answer costs much more than $0.85.

Decision rule: stay with a single-agent loop until you measure a concrete problem — context bloat past 40,000 tokens per loop, quality degradation past 8 tool calls, or per-loop cost above $0.20 on a high-volume workload. Then pick the pattern that matches the failure: map-reduce for cleanly chunked input, critic-loop for accuracy issues, planner-executor for long deterministic workflows, debate only when stakes justify it. The cost discipline that matters most is keeping every worker's prompt scoped tight enough that the per-worker setup tax stays under 25% of that worker's total token spend.


How to project agent cost before you build

Step 1: count tools. List the tools the agent will use. Typical agents use 3-10 tools; one tool definition is ~100-200 tokens depending on schema.

Step 2: estimate tool result sizes. Bytes-per-result and tokens-per-result. Web search ≈ 400-2,000 tokens. Database query ≈ 200-800 tokens. Code execution ≈ 100-500 tokens. Custom API ≈ 100-1,000 tokens depending on payload.

Step 3: estimate loop depth. How many tool calls before the agent reaches the final answer? Typical: 4-8 calls. Long-running research agents: 10-30 calls.

Step 4: sum input tokens across the loop. Start with system prompt + tool definitions (~2,000-6,000 tokens). Each turn adds its prior output (200-500 tokens) + tool result (200-2,000 tokens). After N turns, input cumulative ≈ N × N/2 × average_per_turn (the quadratic growth from history accumulation).

Step 5: multiply by daily loop volume and model rate. Compare against the cached version, batched version, and a tier-drop version. Pick the cheapest that holds quality.

If projected cost > $1,000/day at launch, run a cost optimization pass before launch, not after. Caching + tier drop + tool result compression typically cuts the bill 60-80% with negligible quality impact when done thoughtfully.

Frequently Asked Questions

How much does an AI agent cost per loop?

A typical 6-tool-call agent loop costs $0.02-$0.25 depending on model tier — roughly 10x the cost of a single direct-answer call. Caching can cut this 40-60%; batching can cut another 50% on top. Worked $ math for every major model is in the table above.

Why are agents so much more expensive than chat completions?

Because each turn replays the full conversation history (system prompt + prior messages + prior tool results) as input. After 6 tool calls, input tokens are 10-15x what they would be for a single direct-answer call. Caching the stable system prompt is the canonical fix.

Which model is cheapest for production agents in 2026?

Gemini 2.5 Flash at ~$0.017 per typical loop is the cheapest mainstream tier. Claude Haiku 4.5 at ~$0.046 is the cheapest among Anthropic models. gpt-5.4-mini at ~$0.037 is the cheapest OpenAI option. Match tier to required reasoning depth — most production agents do fine on the cheaper tiers if tools and prompts are well-structured.

How much does prompt caching save on agent loops?

40-60% on input bills when system prompt + tool definitions are cache-eligible and stable across loops. On a $137/day Sonnet 4.6 agent at 1,000 loops/day, caching drops the bill to ~$76/day — a $1,800/month savings. Higher cache-hit rates yield bigger savings.

Should I use LangGraph, Claude Agent SDK, or OpenAI Assistants?

Cost-wise they are similar — all bill on the underlying LLM calls. Choose by ecosystem fit: LangGraph for graph-based multi-agent orchestration, Claude Agent SDK for Anthropic-native tool use with caching, OpenAI Assistants for server-managed threads and integrated retrieval. Pricing differences are in the LLM, not the framework.

How do I cut my agent cost 50% this week?

Step 1: enable prompt caching on the stable system prompt + tool definitions (typically 40-60% input savings). Step 2: drop tool result sizes by summarizing or extracting before returning (typically 20-30% additional input savings). Step 3: drop one model tier on tool-execution sub-agents while keeping the orchestrator on a strong model. Combined: 50-70% savings on most agents.

What is the tool-call multiplier?

Roughly 10-15x more input tokens and 2-3x more output tokens than the same task answered without tools. Caused by the full conversation history replaying on every turn, plus per-turn output (tool call arguments) and tool results. Worked decomposition is in the 'Why agents cost 10x' section above.

Can I run multi-agent setups cheaply?

Yes — use a mixed-tier deployment. Strong model (Sonnet 4.6 or gpt-5.5) for the orchestrator and critic; cheap model (Haiku 4.5 or gpt-5.4-mini) for tool-execution workers and final-answer formatters. Typical savings: 50-60% vs an all-strong-tier setup at similar end-to-end quality.

How many sub-agents should an orchestrator spawn per turn?

3-6 workers is the sweet spot for most production agents. Below 3, the orchestrator still does most of the reasoning itself and the tier-drop savings are small. Above 8, each worker's ~1,500-token system-prompt-plus-tool-definitions setup cost stacks up faster than the cheap-tier savings can offset, and the bill starts climbing again. Coordination failures (workers redoing the same task, missing the brief) also rise with worker count.

When is the critic-loop pattern worth the extra LLM call?

When mistakes are easy for a strong model to spot but hard for the generator to avoid — code generation, structured extraction, factual claims, schema-bound output. A typical critic-loop runs $0.04-$0.08 per iteration and resolves in 1-3 iterations, comparable to a single Sonnet call but with measurably higher accuracy. Skip the pattern when the critic cannot reliably distinguish good from bad — a flaky critic burns money without improving quality.

How much can an orchestrator-worker pattern save versus a single Sonnet loop?

Typically 50-80% on research-style workloads where work divides cleanly. Worked example: a single Sonnet 4.6 research loop with 12 tool calls costs ~$0.26 per query; the orchestrator-worker version (Sonnet 4.6 planner + 5 parallel Haiku 4.5 search workers + Sonnet 4.6 synthesizer) costs ~$0.13 per query — a 51% cut, plus lower latency from parallel execution. Map-reduce on chunked input (e.g. 30 documents) can hit 75-80% savings.

Get the 2026 agent-cost cheat sheet

One-page PDF with per-loop $ math, the tool-call multiplier formula, and the caching/batching levers — free, no signup gate.

Browse all prompt tools →