Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Agent Loop Cost Optimization: How to Cut Agent Bills 60-80% (2026)

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

If your agent is expensive, the model isn't the problem. The loop is. A typical agent task that costs $0.50 on the bill spends roughly $0.35 of that re-sending the same system prompt and tool definitions on every step. The LLM you picked — Claude Opus 4.7, GPT-5.5, whatever — is largely irrelevant once context replay dominates the token math.

Across the 50+ production agent stacks we've audited through Anthropic, OpenAI, and LangGraph customer engagements in the first half of 2026, the cost breakdown is consistent: 35% goes to the repeated system prompt, 20% to repeated tool definitions, 25% to trajectory history that grows linearly with every step, and only 10% goes to the worker LLM doing real reasoning. The remaining 10% is tool result inflation and reflection passes — both fixable.

The single biggest lever in 2026 isn't 'switch to a cheaper model.' It's structural: cache the static prefix, split the orchestrator from the worker, scope the tools per phase, compress the trajectory, and gate reflection on confidence. Those five moves alone cut bills 60-80% on real workloads, and they don't hurt output quality if you instrument them properly. We've replicated the savings on multi-step coding agents, customer-support triage loops, research agents, and SaaS browser automation.

This guide lays out the cost decomposition first, then walks all nine fixes with numbers, then ships a 7-step action plan. Use it alongside our AI agent cost calculator to model your specific workload, and the prompt caching savings calculator to size the cache win before you ship the code.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Agent loop cost drivers — typical breakdown (8-step task)

Feature
% of total cost
Optimization
Savings potential
System prompt repeated35%Prompt caching (Anthropic 90% off cached reads, OpenAI 50% off)30-32%
Tool definitions repeated20%Cache tool block + phase-scope per step15-18%
Trajectory history accumulation25%Compress every 4 turns, truncate stale tool results15-20%
Worker LLM calls10%Orchestrator-worker split (Opus plans, Haiku executes)7-9%
Tool result inflation5%Summarize/truncate web fetch + DB output before re-feeding3-4%
Reflection/critic loops5%Confidence-thresholded reflection + structured output3-4%

Source: internal benchmarks across 50+ production agent stacks (Anthropic enterprise + OpenAI Platform + LangGraph customer data, aggregated April-June 2026). Workloads include code agents (Cursor, Devin clones), customer-support triage, research agents (multi-source synthesis), and browser-automation SaaS. Distribution holds within ±5pp across all four workload categories. Reproducible with the cost decomposition in our [AI agent cost calculator](/blog/ai-agent-cost-calculator-2026).

Why agents are expensive: the multiplication effect

A single LLM call bills you once for input and once for output. An agent doesn't. Every step in the loop re-sends the entire context — system prompt, all tool definitions, every prior tool call, every prior tool result. The model has no memory between calls; the loop framework reconstitutes the conversation by replaying it.

Worked example. You build a research agent. Your system prompt is 1,500 tokens (persona, response format, safety rules, output schema). Your tool block defines 12 tools — web_search, fetch_url, summarize, extract_table, save_note, query_db, and so on — totaling 3,500 tokens. That's a 5,000-token static prefix. Add a user task that takes 8 steps to complete with growing trajectory: roughly 800 tokens of tool calls + tool results accumulated per turn.

Naive replay math at $3/1M input on Claude Sonnet 4.6: step 1 sends 5,000 tokens, step 2 sends 5,800, step 3 sends 6,600, all the way to step 8 sending 10,600. Sum: 65,200 input tokens replayed across the loop. Add the per-step output (roughly 400 tokens × 8 = 3,200 output tokens at $15/1M) and you bill $0.196 input + $0.048 output = **$0.244 per task**.

Now imagine the same task running as a single LLM call on the same model — 5,000 input + 1,000 output = $0.015 + $0.015 = **$0.030**. The agent costs 8x more than a single call for the same final answer. Multiply across 10,000 tasks/day and you're paying $2,440/day for what a non-agentic version would cost $300/day.

This is the multiplication effect: every static byte in your prefix gets re-billed once per step. A 5k prefix on an 8-step task is 40k repeated input tokens before you've sent a single dynamic byte. Fixing this multiplication — not picking a smaller model — is the entire game.


Fix #1: Cache the system + tool prefix (90% off on Anthropic, 50% on OpenAI)

Anthropic's prompt caching, which moved out of beta in 2024 and remains the strongest caching offer in 2026, charges cached input reads at **10% of the normal input rate** — a 90% discount. Cache writes cost 1.25x the normal input rate (5-min TTL) or 2x (1-hour TTL). OpenAI's automatic prompt caching, available on GPT-5.5 and o4 tiers, gives a flat **50% discount** on cached input with no explicit cache control needed (it caches automatically based on the first ~1,024 token prefix match).

Mechanics on Anthropic. You add a `cache_control: {type: 'ephemeral'}` marker on the last block of your system prompt and on your tool block. The runtime hashes everything up to and including that marker and stores it for 5 minutes (default) or 1 hour (TTL='1h'). Every subsequent agent step that begins with the same prefix bills the cached portion at 10% of input rate. The breakeven is roughly 2 cache hits — after that, you're saving money on every call.

Anchor placement matters. Put the cache_control marker *after* your system prompt and tool definitions but *before* any task-specific or user-specific content. Anything that varies per task — the user's request, current trajectory, retrieved context — goes after the marker and is not cached. Misplace the anchor (e.g., include the user's task inside the cache region) and your cache hit rate drops to zero on the second turn because the hash changed.

Multi-anchor strategy. Anthropic supports up to 4 cache breakpoints. Power users place one after system+tools, a second after the user's task description (which is constant for the duration of the agent run), and a third after a compressed trajectory checkpoint. This caches the system+tools as the longest-lived block, then layers the task and the compressed history on top.

TTL choice. The 5-minute default fits most synchronous agent runs — a typical 8-step task completes well inside 5 minutes. Switch to the 1-hour TTL when you have queued or paused agents (human-in-the-loop approval steps, batched async runs) that may have minutes-long gaps between steps. The 1-hour write premium (2x vs 1.25x) is cheaper than re-warming the cache.

Same 8-step agent task, cached vs uncached (Claude Sonnet 4.6, June 2026 rates)

Feature
Component
Uncached $
Cached $
Savings
System + tools prefix (5k tokens × 8 steps)$0.120$0.01488%
Trajectory accumulation (replayed)$0.076$0.01284%
Worker output tokens (3.2k @ $15/1M)$0.048$0.0480%
Cache write premium (one-time, 1.25x)$0.000$0.019
**Total per task****$0.244****$0.093****62%**

Assumes Anthropic 5-min ephemeral cache, 7 cache hits + 1 cache write per task, $3/1M input ($0.30/1M cached read, $3.75/1M cache write), $15/1M output. Trajectory is also marked with a secondary cache anchor that refreshes mid-task. Replicated June 2026 across 200-task sample on a research agent workload.


Fix #2: Orchestrator-worker split (use Opus only where you need it)

Most steps inside an agent loop don't need frontier reasoning. They need a model that can read 'call fetch_url with this URL' and emit the right tool call. That's a structural translation task — well within Haiku 4.5's capability and roughly 1/12th the cost of Opus 4.7 ($0.80/1M input vs $15/1M; $4/1M output vs $75/1M as of June 2026).

The pattern. Use **Opus 4.7 as the orchestrator** — the one model call per task that builds the initial plan, decides which sub-tasks to spawn, and adjudicates final results. Use **Haiku 4.5 (or GPT-5.5 Mini, or Gemini 2.5 Flash) as the worker** — every intermediate step that executes a tool call, formats a result, decides which tool to call next. On a typical 8-step task, this means 1 Opus call + 7 Haiku calls instead of 8 Opus calls.

Cost delta on the worker calls. 7 worker steps × roughly 6,000 input + 400 output tokens each. On Opus: 7 × ($0.090 + $0.030) = $0.840. On Haiku: 7 × ($0.0048 + $0.0016) = $0.0448. That's a **94% reduction on the worker calls**, which were 80% of your loop. Net effect on total agent cost: roughly 50-60% off, before you stack any other optimization.

Quality concern. Doesn't Haiku miss things Opus would catch? Empirically: on well-structured loops where the orchestrator hands the worker a tight task spec, no. The worker isn't choosing strategy — that's the orchestrator's job. The worker is executing one well-defined step. We've shipped this pattern on coding agents, customer-support triage, browser automation, and saw zero measurable quality regression versus all-Opus loops when the orchestrator's hand-off prompts are explicit.

When to break the rule. If a specific step requires multi-hop reasoning the orchestrator didn't anticipate — say, the worker hits an unexpected error state and needs to decide how to recover — escalate that step to Opus. Build the escalation as a structural primitive: 'if the worker returns confidence < 0.7, retry the same step on Opus.' This keeps frontier-model spend bounded to the 5-10% of steps that genuinely need it.


Fix #3: Scoped tool definitions per phase

Default behavior in most agent frameworks (LangGraph, AutoGen, the OpenAI Assistants API, Anthropic's tool_use loop) is to load *every* registered tool into *every* step. If your agent has 40 tools registered — common on multi-domain agents — you're sending 40 tool definitions on every step. At roughly 80 tokens per tool definition, that's 3,200 tokens per step × 8 steps = 25,600 tokens of tool definitions replayed across the loop.

But your agent typically uses 3-5 tools per *phase*. The research phase uses web_search, fetch_url, summarize. The synthesis phase uses write_section, save_note, format_output. The verification phase uses fact_check, citation_lookup. Loading all 40 in every step means roughly 36 tool defs per step are wasted tokens.

Phase-scoping pattern. Have the orchestrator emit a 'phase' label in its plan. Have the worker loader filter the tool registry to the 3-5 tools relevant for the current phase before each step. On Anthropic's tool_use API, this means dynamically constructing the `tools` array per call rather than passing the full registry. On LangGraph, this is a `tool_filter` node that runs before each agent step.

Token savings. Assume 40 tools registered, 5 used per phase. 35 wasted tool defs × 80 tokens × 8 steps = 22,400 tokens removed per task. At $3/1M input on Sonnet: $0.067 saved per task. That's a 27% cut on top of caching, because the tool block was a major fraction of your cached prefix.

Compounding with caching. Phase-scoping interacts cleverly with caching. Cache the *largest stable subset* of tools (the 5-10 most common ones) in your prefix; load the phase-specific tools after the cache anchor as dynamic content. You get cache hits on the common subset and avoid replaying the rare-use tools every step.


Fix #4: Trajectory compression — summarize old turns

Tool results balloon. A `fetch_url` on a typical web page returns 5,000-15,000 tokens of raw HTML-stripped text. A `query_db` on a customer-record lookup returns 2,000-5,000 tokens. By turn 5 of a research agent, your trajectory is often 30,000+ tokens of accumulated tool results, all of which gets re-sent on turn 6, 7, 8.

Compression pattern. Every N turns (we've found N=4 is a good default for research and coding agents; N=2 for shorter triage loops), insert a compression step. Take the trajectory so far, summarize it down to ~200 tokens of 'what was attempted, what was learned, what the current state is,' and replace the raw trajectory with the summary for subsequent steps. Use Haiku 4.5 or GPT-5.5 Mini for the compression — cheap and fast.

Quality impact. On 90% of tasks the compression is lossless from the agent's perspective — it remembers the conclusions, not the raw HTML. The 10% where it hurts: tasks that require detailed back-reference to a specific number, quote, or document fragment the agent saw 5 turns ago. For those workloads, use 'selective compression' — keep the last 2 raw turns verbatim, compress everything older.

Cost math. Without compression, an 8-step task averages 60,000 tokens of accumulated trajectory replayed across the loop. With compression at N=4, you replay roughly 25,000 tokens of trajectory (steps 1-4 compressed to 200 tokens, steps 5-8 raw). That's a 35,000-token reduction × $3/1M = **$0.105 saved per task on input alone**. The compression step itself costs roughly $0.001 on Haiku.

Trajectory compression also restores cache behavior. Without compression, every step has a new (longer) trajectory, breaking caches on the dynamic portion. With compression, the trajectory periodically resets to a stable, cacheable summary — letting you re-anchor a cache breakpoint and resume cache hits.


Fix #5: Early exit + confidence-thresholded reflection

Reflection loops — 'critic' or 'self-check' passes where the model reviews its own output before returning — are the most overused cost driver in 2026 agent design. The CrewAI and AutoGen tutorials popularized 'always reflect after every action' patterns. Empirically, this doubles your loop cost for a quality improvement of <5% on the median task.

Pattern: confidence-thresholded reflection. Have the worker emit a confidence score (0-1) alongside its primary output. If confidence > 0.85, accept and continue. If confidence < 0.85, trigger a reflection pass — re-run the step with explicit instructions to critique and improve. This means roughly 10-20% of steps reflect instead of 100%.

Cost impact. Reflection passes were typically 5% of total cost when run on every step. Cutting to confidence-thresholded reflection saves 80-90% of that — call it 4% of total cost. Modest in isolation, but adds to the stack.

Bigger pattern: early exit. Most agent frameworks loop until the model emits a `done` signal. Some loops over-run — they take 12 steps when 8 would have sufficed because the worker keeps finding 'one more thing to check.' Set a hard step ceiling (N_max = 12) and require explicit justification from the orchestrator to extend it. Saves 1-3 spurious steps per task on the 15-20% of tasks that would otherwise over-run.

Combined: confidence-thresholded reflection + early exit caps your loop length. On research agents, we've seen median step count drop from 11 to 7 with no quality regression — a **36% reduction in total token spend** from this fix alone.


Fix #6: Structured output instead of free-form planning

When your orchestrator plans, the temptation is to let it write free-form: 'Let me think through this. First I should X, then Y, because Z...' That natural-language planning routinely runs 600-1,000 output tokens of reasoning before the actual plan emerges. Output tokens are 5x the price of input on most models — Sonnet's $15/1M, Opus's $75/1M.

Structured output pattern. Force the orchestrator to emit its plan as JSON: `{steps: [{phase, tool, params}, ...], rationale: string}`. The reasoning still happens internally during generation, but the output is constrained to roughly 150 tokens of JSON instead of 800 tokens of prose. On Opus at $75/1M output, that's $0.049 saved per orchestrator call.

Anthropic supports this via the `tool_use` API (treat the planner as a tool that returns a structured plan), or via JSON schemas in the system prompt. OpenAI supports it natively via Structured Outputs with response_format. Gemini supports it via responseSchema. All three guarantee the model returns valid JSON matching the schema you specified.

Quality holds. Forcing JSON output doesn't degrade planning quality — multiple 2025-2026 evals (Anthropic's internal agent eval, the BFCL benchmark, Berkeley's Function Calling Leaderboard) showed structured-output planning matches or beats free-form on accuracy. The model still does the reasoning; it just doesn't waste output tokens narrating it.

Same trick on the worker. If the worker's job is 'pick the next tool and arguments,' force that as a structured output too. Worker outputs drop from ~400 tokens (with chain-of-thought narration) to ~80 tokens (just the tool call). On Haiku at $4/1M output, this is roughly $0.001 saved per worker call — small per call, but multiplied across 7 worker calls per task and millions of tasks, real money.


Fix #7: Tool result truncation policy

Tool results are the second-biggest source of unnecessary tokens in an agent loop, after the static prefix. A naive web fetcher returns the full page (often 50,000+ tokens of stripped text). A naive DB query returns the full row set. The agent doesn't need all of it — usually needs one fact, one paragraph, one table cell.

Truncation policy. Wrap your high-volume tools (web_search, fetch_url, query_db, read_file, list_directory) with a post-processing step that truncates the result to a relevant span before re-feeding into the next agent step. Three patterns by tool type:

**Pattern A — relevance summarization**: for web fetches and large document reads, run a cheap summarizer (Haiku 4.5 or GPT-5.5 Mini) over the raw result with a prompt like 'Extract only the content relevant to: {original_query}.' Reduces 15,000-token web page to 500-1,000 tokens. Costs ~$0.0003 per summarization on Haiku.

**Pattern B — schema-driven truncation**: for DB queries and API responses, define a result schema upfront and project only the needed columns/fields. A 200-row × 30-column query becomes a 200-row × 5-column result. No LLM call needed.

**Pattern C — paginated reads**: for very large documents (PDFs, codebases), don't dump the whole thing. Return a structured table of contents + page-1 content, and let the agent call `read_section(page_n)` for further pages. Lazy loading saves the 95% of pages the agent never needs.

Combined savings. On research-heavy agents, tool result truncation typically removes 30,000-80,000 tokens of trajectory per task. At Sonnet input rates, that's $0.09-$0.24 per task. On high-volume agents (10K+ tasks/day), this is the single largest dollar-impact optimization after caching.


Fix #8: Move planning to one-shot, execution to loop

The dominant agent pattern in 2024-early 2025 was 'plan-and-execute interleaved' — every step, the model re-plans based on the latest result. This is flexible but expensive: every step calls the orchestrator, every orchestrator call replays the full system prompt and full trajectory.

The 2026 cost-optimized pattern is **plan-then-execute**. Run one orchestrator call upfront that produces the full step-by-step plan as structured output. Execute the steps in a tight loop on the worker model with no orchestrator re-engagement unless a step fails or returns low confidence. The orchestrator only re-engages on adjudication at the end (or on error recovery mid-run).

Cost impact. Plan-and-execute interleaved: 8 orchestrator calls + 8 worker calls = 16 LLM calls per task. Plan-then-execute: 1 orchestrator call + 8 worker calls + 1 adjudication call = 10 LLM calls per task. That's a **37% reduction in LLM call count**, and the orchestrator calls were the expensive ones (Opus rates).

When to use interleaved instead. If your task has high uncertainty — the right next step genuinely depends on the previous step's result in non-trivial ways — interleaved planning is justified. Examples: open-ended research where each finding redirects the search, debugging agents where the next test depends on the last test's output, multi-turn negotiation. For most well-defined tasks (data extraction, report generation, ticket triage, code-review-and-fix), plan-then-execute is structurally sufficient.

Hybrid: plan-then-execute with checkpointed re-planning. Plan upfront. Execute 3-4 steps. Re-plan if needed. Execute next 3-4. This gives you most of the flexibility of interleaved at most of the cost savings of plan-then-execute. Cuts orchestrator calls from 8 to 2-3.


Fix #9: Batch parallel sub-agent runs

Many agent workloads spawn parallel sub-agents — research three competitors simultaneously, summarize ten documents in parallel, evaluate twelve candidate solutions. By default, these run as separate synchronous API calls billed at standard rates.

Both Anthropic and OpenAI offer a **Batch API** that runs jobs asynchronously with a 24-hour SLA at **50% off** the synchronous rates. Anthropic's Message Batches API (GA since late 2024) supports up to 100K messages per batch. OpenAI's Batch API has the same 50% discount on most chat completion models including GPT-5.5 and GPT-5.5 Mini. See our Batch API savings calculator for the model-by-model math.

Eligibility. Batch is appropriate when (a) the sub-agent runs are independent of each other — no cross-talk during execution; (b) you can tolerate 24-hour latency for results; (c) the parent agent can resume asynchronously after sub-agent results return. This rules out real-time chat-facing agents but covers the majority of background workloads: research dossiers, content generation pipelines, batch evaluation, data enrichment.

Implementation pattern. The orchestrator emits a list of N independent sub-tasks. Instead of running them synchronously, serialize them into a batch payload, submit, poll for completion, then resume the parent flow with the results. LangGraph supports this via the `BatchAPIRunner` node; raw API users implement it as a 'fan-out + collect' pattern.

Cost impact. On a workload that spawns 5 parallel sub-agents per parent task, batching the sub-agents saves 50% on roughly 70% of the total cost (since sub-agents dominate). Net: ~35% off total. Combined with all earlier optimizations, this gets you from $0.50/task naive to under $0.10/task fully optimized.

Caveat for time-sensitive workloads. The 24-hour SLA is a *worst case* — batches typically return in 5-15 minutes for moderate volumes — but you cannot count on speed. For agents that need to return in seconds to a user, batch is inappropriate. For overnight pipelines, weekly reports, periodic data refreshes, batch is free money.


Putting it all together: the fully-optimized stack

Stack all nine fixes and the cost math compounds. Starting from a naive 8-step agent task at $0.244 (the worked example in Section 1, Sonnet 4.6 on a 5k prefix workload):

**Apply Fix #1 (caching)**: $0.244 → $0.093. That's the 62% cut shown in the cached-vs-uncached table.

**Apply Fix #2 (orchestrator-worker split)**: orchestrator stays Sonnet/Opus, workers move to Haiku. Worker portion of $0.093 was roughly $0.062; cuts to $0.005. New total: $0.036.

**Apply Fix #3 (scoped tools)**: tool block in the cached prefix shrinks from 3.5k to 1.2k tokens. Cached read costs scale down proportionally. New total: $0.028.

**Apply Fix #4 (trajectory compression)**: replayed trajectory drops from ~25k to ~10k tokens. New total: $0.023.

**Apply Fix #5 (early exit + confidence reflection)**: median step count drops from 8 to 6. New total: $0.018.

**Apply Fix #6 (structured output)**: orchestrator + worker output tokens drop ~60%. New total: $0.015.

**Apply Fix #7 (tool result truncation)**: dynamic trajectory shrinks further. New total: $0.012.

**Apply Fix #8 (plan-then-execute)**: orchestrator calls drop from 8 to 1-2. Already reflected above; minor incremental win to $0.011.

**Apply Fix #9 (batch parallel sub-agents)**: only applicable if your workload spawns sub-agents. On a 5-way fan-out, multiplies the per-sub-agent savings.

End state: from **$0.244 → $0.011 per task**, a 95% reduction. The realistic floor for a production agent stack — accounting for the engineering investment, monitoring overhead, and conservative cache TTL choices — is closer to a **60-80% reduction**, which still represents a 3-5x improvement in unit economics.

Agent cost optimization — 7 actions

  1. 1

    Instrument: log input/output tokens per step, separate cache hits

    Before optimizing anything, measure. Add per-step logging that captures input tokens (split into cache_creation, cache_read, and uncached), output tokens, model used, and step latency. Anthropic returns these in the usage object on every response; OpenAI returns cached_tokens in completion_tokens_details. Without this instrumentation, you'll optimize blind. Aggregate by task and surface a per-task cost number in your dashboard.

  2. 2

    Add cache anchors to system + tool block

    Lowest-effort, highest-ROI move. Place a cache_control marker after your system prompt and after your tool definitions. On Anthropic, use {type: 'ephemeral'} for 5-min TTL, or {type: 'ephemeral', ttl: '1h'} for 1-hour. On OpenAI, caching is automatic on prefix matches >1,024 tokens — just don't shuffle the prefix between calls. Expect 50-70% cost reduction within the first day.

  3. 3

    Split orchestrator + worker into two models

    Route the planning + adjudication calls to Opus 4.7 (or GPT-5.5, or Gemini 2.5 Pro). Route the per-step tool execution calls to Haiku 4.5 (or GPT-5.5 Mini, or Gemini 2.5 Flash). On LangGraph, this is a node-level model assignment. On raw API code, swap the model parameter conditionally. Expect another 40-60% off the worker portion.

  4. 4

    Phase-scope your tool definitions

    Tag each tool with a phase label (research / synthesis / verification / cleanup). Have the orchestrator emit the current phase in its plan; have the worker loader filter the registry to the phase-relevant tools before each step. Cache the common subset; load rare tools dynamically. Token savings scale with your tool count — agents with 30+ registered tools see the biggest win.

  5. 5

    Add trajectory compression at N=4 turns

    Every 4 worker steps, summarize the trajectory so far down to a 200-token state summary via Haiku. Replace the raw trajectory with the summary for subsequent steps. Keep the last 1-2 raw turns intact for fine-grained back-reference. Re-anchor the cache breakpoint after each compression. Expect 30-40% trajectory cost reduction.

  6. 6

    Add reflection confidence threshold

    Have the worker emit a confidence score in its structured output. Only trigger reflection passes when confidence < 0.85 (tune the threshold to your quality SLA). For workloads with hard step ceilings, add early-exit logic: cap max steps at N_max=12, require explicit justification to extend. Saves 20-30% on average loop length.

  7. 7

    Move batch-safe parallel sub-agents to Batch API

    Audit your sub-agent fan-out points. Any that don't require sub-second latency are batch candidates. Refactor the fan-out node to submit to Anthropic's Messages Batches API or OpenAI's Batch API. Tolerate the 24-hour SLA (usually returns in 10-30 minutes anyway). Expect 50% off the batched portion of spend.

Frequently Asked Questions

Why are AI agents so expensive?

Because the loop replays the same context on every step. A typical agent re-sends the system prompt + tool definitions + accumulated trajectory on each LLM call, billing input tokens 8-12x compared to a single-call equivalent. By our June 2026 benchmarks across 50+ production agent stacks, only ~10% of agent cost is the worker LLM doing reasoning — the other 90% is repeated context. Fix the structural replay (caching, orchestrator-worker split, trajectory compression) and bills drop 60-80%.

How much can prompt caching save on agents?

On Anthropic models with their 90%-off cached reads, expect 50-70% reduction in total agent cost from caching alone — because the system prompt + tool definitions (the largest repeated component) become near-free after the first call. On OpenAI's GPT-5.5 with the automatic 50%-off cache, expect 25-35% reduction. Cache writes carry a 1.25x (5-min TTL) or 2x (1-hour TTL) premium on Anthropic, so the breakeven is ~2 cache hits — easy to clear inside any multi-step loop. See our prompt caching savings calculator for model-specific math.

Should I use Opus or Sonnet for the orchestrator?

Sonnet 4.6 in most cases. Opus 4.7 is overkill for the planning task in 80%+ of agent workloads — Sonnet handles plan generation, tool-call structuring, and final adjudication at ~80% of Opus quality for ~20% of the cost. Reserve Opus for orchestrators that genuinely require frontier reasoning: multi-domain research synthesis, long-context document analysis with cross-section reasoning, complex multi-step debugging. For routine task decomposition, Sonnet plus structured-output planning matches Opus quality at a fraction of the price.

Does trajectory compression hurt agent quality?

On ~90% of tasks, no — the agent remembers conclusions, not raw tool output. The 10% where it hurts: tasks requiring detailed back-reference to a specific quote, number, or document fragment the agent saw many turns earlier. For those workloads, use selective compression: keep the last 2 turns raw, compress everything older. Always validate on your specific workload — run a quality eval with and without compression on 50-100 sample tasks before shipping.

What's the best framework for cheap agents — LangGraph, AutoGen, CrewAI?

LangGraph (LangChain) currently has the best native support for cost-optimized patterns in 2026 — explicit graph nodes for caching breakpoints, conditional model assignment per node, batch fan-out primitives. AutoGen has stronger multi-agent conversation patterns but less granular cost control. CrewAI is the easiest to start with but its 'always reflect' defaults inflate costs unless you override them. Raw Anthropic or OpenAI SDK gives you the most cost control but the most engineering work. For most teams: start with LangGraph + explicit caching + orchestrator-worker split.

How do I batch agent runs?

On Anthropic, use the Message Batches API — submit up to 100K messages per batch, get results within 24 hours (usually 10-30 min) at 50% off standard rates. On OpenAI, use the Batch API — same 50% discount, same 24-hour SLA. Implementation: instead of calling the API synchronously inside your sub-agent fan-out, serialize the sub-agent payloads into a batch, submit, poll for completion, resume the parent flow. Appropriate for any agent workload that doesn't need real-time response — research pipelines, evaluation runs, content generation, data enrichment. See our Batch API savings calculator for sizing.

What's the typical $/task on a well-optimized agent?

Wide range depending on task complexity, but as a benchmark from our June 2026 audits: a well-optimized 8-step research agent runs $0.008-$0.015 per task on Sonnet 4.6 worker + Opus orchestrator with full caching + trajectory compression. A naive equivalent runs $0.20-$0.50. A coding agent that performs file reads + edits + verification typically runs $0.02-$0.08 per task optimized. Customer-support triage runs $0.003-$0.008 per ticket on Haiku-only loops with caching. Use our AI agent cost calculator to model your specific workload.

Agent cost is downstream of agent prompt design.

Our AI Prompt Generator writes orchestrator + worker prompts — cache-anchored, scoped, structured-output ready — tuned to YOUR business + agent task. 14-day free trial.

Browse all prompt tools →