Single-agent loops hit a ceiling. Past 8-10 tool calls, the context window fills with stale tool results, the orchestrator's reasoning quality degrades, and the per-turn cost climbs quadratically because each new turn replays everything that came before. The fix that has emerged across 2026 production deployments is the orchestrator-worker pattern: one strong agent (Sonnet 4.6, gpt-5.5, or Opus 4.8) decides what work needs doing and delegates discrete tasks to a fleet of cheaper sub-agents (Haiku 4.5, gpt-5.4-mini, Gemini 2.5 Flash), each of which operates in its own fresh context window. The orchestrator never sees the raw tool output — only the worker's compressed summary. Done well, this cuts the bill 60-80% versus a single Sonnet loop at equal or better answer quality. Done badly, it triples the bill because every worker reload pays its own system-prompt tax.
Worked comparison on a research workload (find and synthesize five sources on a technical question). Single Sonnet 4.6 loop: 12 tool calls, ~62,000 cumulative input tokens, ~5,000 output. Bill: $0.261 per query. Orchestrator-worker version: Sonnet 4.6 orchestrator runs a 4-call planning loop (~12,000 input, 1,200 output = $0.054), spawns 5 parallel Haiku 4.5 search workers each with a 1,500-token scoped prompt and 3 tool calls returning a 400-token summary (~8,000 input + 600 output per worker × 5 = $0.032 + $0.006 = $0.038 total), then a final Sonnet 4.6 synthesizer takes the 5 summaries (~4,500 input + 1,500 output = $0.036). Grand total: $0.128 per query — a 51% cut. End-to-end latency drops too because the 5 workers run concurrently rather than sequentially in one loop.
The sub-agent count is a real tradeoff, not a free lever. Too few workers and the orchestrator still does most of the reasoning itself, which means strong-tier tokens get spent on grunt work; the cost barely moves. Too many workers and three problems compound: each worker pays its own ~1,500-token system-prompt-plus-tool-definitions setup cost (which is not amortized across the swarm), the orchestrator burns tokens reading and merging N summaries, and coordination failures (workers redoing the same work, missing the brief) drag down quality. The sweet spot for most production agents is 3-6 workers per orchestrator turn. Above 8 workers, the per-worker setup tax overtakes the tier-drop savings and the bill starts climbing again.
Map-reduce is the workhorse pattern when the input divides cleanly. The orchestrator partitions the work (5 documents, 12 log shards, 30 product reviews), spawns one cheap worker per chunk to extract or score, then merges the structured outputs. Cost profile: linear in chunk count, no history accumulation per worker because each worker sees only its chunk. Real numbers on a 30-document classification task: single Sonnet loop replaying all 30 docs in context = ~$0.84 per run; map-reduce with 30 Haiku workers + Sonnet merger = ~$0.19 per run, a 77% cut. Worth the orchestration code when chunk count exceeds 5 and chunks fit in worker context.
Critic-loop pairs a generator with a verifier. The generator (often cheap — Haiku 4.5 or gpt-5.4-mini) drafts an answer; the critic (strong — Sonnet 4.6 or Opus 4.8) inspects it for errors and either approves or returns specific corrections. Each loop costs the sum of one cheap call and one strong call, typically $0.04-$0.08 per iteration, and 1-3 iterations resolves most tasks. Net cost is comparable to a single Sonnet call but with measurably higher accuracy on tasks where mistakes are easy to spot but hard to avoid (code generation, structured extraction, factual claims). Skip this pattern when the critic cannot reliably distinguish good answers from bad — debugging a broken critic burns money without improving quality.
Planner-executor splits the strong-model reasoning from the bulk execution. A Sonnet 4.6 or Opus 4.8 planner produces a structured 5-15 step plan in one call ($0.02-$0.06), then a Haiku 4.5 or gpt-5.4-mini executor runs each step with tight scope and no need to re-plan. The executor never sees the full problem — only the current step plus relevant tool results — which keeps its context window small. Useful when steps are independent or only loosely coupled. Debate (N independent models propose answers, a judge picks the best) is the most expensive pattern in this family and worth the cost only when answer correctness has high downstream stakes (legal review, medical triage, financial decisions). Three-model debate at Sonnet 4.6 + Sonnet 4.6 + Opus 4.8 with an Opus 4.8 judge runs roughly $0.85 per decision — reserve for cases where a wrong answer costs much more than $0.85.
Decision rule: stay with a single-agent loop until you measure a concrete problem — context bloat past 40,000 tokens per loop, quality degradation past 8 tool calls, or per-loop cost above $0.20 on a high-volume workload. Then pick the pattern that matches the failure: map-reduce for cleanly chunked input, critic-loop for accuracy issues, planner-executor for long deterministic workflows, debate only when stakes justify it. The cost discipline that matters most is keeping every worker's prompt scoped tight enough that the per-worker setup tax stays under 25% of that worker's total token spend.