Skip to content
LLM economics · Cost engineering · Production

LLM Cost Engineering 2026: Token Economics + The 7 Levers That Cut Production Spend 60-90%

LLM bills hit $50K/month at scale before teams notice. The 7 cost levers: model right-sizing, prompt caching, structured output, retrieval-not-context, batching, semantic cache, model cascade. Math for each + tool comparison.

By Andy Gaber, Founder, Digital Dashboard HubUpdated

Per OpenAI's pricing at platform.openai.com, Anthropic's pricing at anthropic.com, Google's Gemini pricing at ai.google.dev, Helicone at helicone.ai, Langfuse at langfuse.com, and Vercel's AI Gateway at vercel.com, production LLM spend is the fastest-growing line item in most AI-powered software companies. Untouched, monthly LLM cost typically grows 30-100% per quarter as usage scales.

Per arxiv research on LLM cost optimization at arxiv.org, most production LLM systems run at 5-10× higher cost than necessary. The optimization opportunity is real + the levers are well-documented. The reason most teams don't apply them: cost engineering is invisible until the bill is already large.

Below: the 7 cost levers, math for each, and the order of application (highest-leverage first). Sources include OpenAI pricing at openai.com, Anthropic pricing at anthropic.com, Google Gemini at ai.google.dev, Helicone at helicone.ai, Langfuse at langfuse.com, Vercel AI Gateway at vercel.com, OpenRouter at openrouter.ai, and arxiv at arxiv.org.

7 LLM cost levers — savings + when each applies

Feature
Typical savings
When applies
Cost-engineering tool
1. Model right-sizing40-70%Workloads using frontier where smaller would sufficeHelicone, Langfuse for usage profiling
2. Prompt caching50-80% on cached prefixShared prefixes >1024-4096 tokensProvider APIs (OpenAI, Anthropic, Google)
3. Retrieval depth (top-5 not top-50)60-90% on context tokensRAG workflowsCohere re-ranker, Pinecone
4. Structured output (constrained decoding)Eliminates retry cost ~15-40%JSON output workflowsOpenAI / Anthropic / Google structured APIs
5. Batch API50%Async-tolerant workloadsOpenAI Batch, Anthropic Batch
6. Semantic cache30-70%High-repetition customer queriesRedis, Pinecone, pgvector
7. Model cascade10-30%Mixed-complexity queriesVercel AI Gateway, OpenRouter

Pricing references per [OpenAI at openai.com](https://openai.com/api/pricing/), [Anthropic at anthropic.com](https://www.anthropic.com/pricing), [Google Gemini at ai.google.dev](https://ai.google.dev/pricing). Observability + cost-engineering platforms: [Helicone at helicone.ai](https://helicone.ai/), [Langfuse at langfuse.com](https://langfuse.com/), [Vercel AI Gateway at vercel.com](https://vercel.com/), [OpenRouter at openrouter.ai](https://openrouter.ai/). Research per [arxiv at arxiv.org](https://arxiv.org/).

Lever 1 — Model right-sizing (40-70% potential savings)

**The mechanic:** Frontier models (GPT-4o, Claude Opus, Gemini Pro) cost 10-50× more per token than capable smaller models (GPT-4o-mini, Claude Haiku, Gemini Flash). Per OpenAI pricing at openai.com and Anthropic pricing at anthropic.com, most production workloads use frontier models for tasks that smaller models handle equivalently.

**The audit:** Sample 100 production LLM calls. For each, ask: does this genuinely need the frontier model? Per Helicone at helicone.ai and Langfuse at langfuse.com, most teams find 50-80% of calls could use the cheaper tier with equal quality.

**The math:** A workload at $20K/month on frontier where 60% moves to small-model tier = $20K × 60% × (1 - 1/15) = ~$11K/month savings. Single largest cost lever for most LLM products.

**The implementation:** Per Vercel AI Gateway at vercel.com and OpenRouter at openrouter.ai, provider abstraction layers let you swap models per-request based on task complexity. Eval-set verification on the cheaper tier before promoting.


Lever 2 — Prompt caching (50-80% savings on cached prefix tokens)

**The mechanic:** Per Anthropic prompt caching at docs.anthropic.com and OpenAI prompt caching at platform.openai.com, providers cache prompt prefixes. Cached tokens cost 10-50% of normal input tokens.

**When it applies:** Workloads with shared prompt prefixes (long system prompts, RAG with same retrievals, few-shot examples). Per Google Gemini caching at ai.google.dev, minimum 1024-4096 tokens prefix for cache eligibility.

**The math:** Workload with 5000-token system prompt + 500-token user message at $3/M input = $16.50/1000 calls. With prompt caching: cached prefix $1.50 + uncached user $1.50 = $3.00/1000 calls. 82% savings on the cached portion.

**The break-even:** 2-3 cache hits per write.


Lever 3 — Retrieve, don't include (60-90% savings on context-heavy workloads)

**The mechanic:** Workloads with long retrieved contexts often include too much. Per Anthropic at docs.anthropic.com, a 50-document retrieval where 5 documents would suffice wastes 90% of input tokens.

**The audit:** For RAG workflows, measure: how often does the LLM cite documents beyond the top 5 retrieved? Per Pinecone at pinecone.io, typically <10% of citations come from documents 6-50. Retrieving fewer documents saves cost without quality loss.

**The implementation:** Better re-ranking before injection. Per Cohere at docs.cohere.com, LLM-based re-rankers + retrieving top-5 outperform vanilla top-50 retrieval at lower cost.

**The math:** Workload retrieving 50 × 500-token documents = 25,000 input tokens per query. Re-rank to top-5 = 2,500 input tokens. 90% input-token reduction.


Lever 4 — Structured output (eliminates retry cost)

**The mechanic:** Per OpenAI structured outputs at platform.openai.com, Anthropic tool use at docs.anthropic.com, and Google Gemini structured at ai.google.dev, provider-enforced JSON schema eliminates parse failures + retry loops.

**The hidden cost:** Workloads using prompt-only 'respond in JSON' fail 15-40% of the time + retry. Each retry doubles the cost of that interaction. Constrained decoding solves this at the API level — no retries needed for format failures.

**The math:** 20% retry rate at $0.01 per call = effective cost $0.012 per call (20% extra). Switch to constrained decoding = $0.01 per call + zero format-failure retries. 17% cost reduction on retry-heavy workloads.


Lever 5 — Batching (40-80% savings on async-tolerant workloads)

**The mechanic:** Per OpenAI Batch API at platform.openai.com and Anthropic's batch processing at docs.anthropic.com, batch APIs charge 50% of normal pricing for requests with 24-hour completion windows.

**When it applies:** Background workloads tolerant of asynchronous processing. Content generation, summarization, classification of historical data, eval runs.

**The math:** A workload spending $10K/month on real-time inference where 60% could shift to batch = $10K × 60% × 50% reduction = $3K/month savings.

**The implementation:** Per OpenAI Batch documentation at platform.openai.com, upload JSONL of requests, retrieve results within 24 hours. Per Anthropic batch at docs.anthropic.com, similar pattern.


Lever 6 — Semantic cache (30-70% query elimination)

**The mechanic:** Per Redis semantic caching at redis.io and Pinecone at pinecone.io, embed incoming user query, compare to cached query embeddings, serve cached response on >0.95 similarity.

**When it applies:** High-volume customer-facing workloads with significant query repetition (FAQ-style chatbots, product Q&A, support routing).

**The math:** 50% cache hit rate eliminates 50% of LLM calls entirely. Per Helicone at helicone.ai, this is the highest-leverage savings for FAQ-style workloads — cache hits cost ~$0 vs. full LLM call cost.

**The implementation cost:** Vector DB for semantic cache (Pinecone, Redis with vector module, pgvector) + embedding computation per query. Small relative to savings.


Lever 7 — Model cascade (10-30% savings on routing tasks)

**The mechanic:** Per arxiv research on model cascades at arxiv.org, route simple queries to cheap models + complex queries to expensive models. Confidence threshold determines escalation.

**Implementation:** First-pass small model. If confidence < threshold, retry on larger model. Per Vercel AI Gateway at vercel.com, this pattern can be implemented at the gateway level.

**The math:** Workload with 70% of queries handleable by cheap tier + 30% needing escalation = 70% × cheap cost + 30% × (cheap + expensive cost) ≈ 60% of pure-expensive cost.

**The trade-off:** Adds latency (escalations are 2× LLM calls). Worth it for cost-sensitive workloads where latency is forgiving.

No LLM cost engineering applied: Frontier models for everything. No prompt caching. 50-doc retrievals where 5 would suffice. Prompt-only JSON with 20% retry rate. Real-time everything. No semantic cache. No model cascading. LLM spend grows 30-100% per quarter as usage scales.
7 levers applied (highest-leverage first): Model right-sized to task. Prompt caching on shared prefixes. Top-5 retrieval not top-50. Constrained-decoding structured output. Async batch for non-real-time. Semantic cache catches duplicates. Model cascade for cost-sensitive paths. 60-90% spend reduction at no quality loss.

Apply the 7 cost levers (highest-leverage first)

  1. 1

    Sample 100 calls + identify model right-sizing opportunities

    Per Helicone at helicone.ai and Langfuse at langfuse.com, most teams find 50-80% of frontier-model calls could use cheaper tier. Eval-set verification on cheaper tier before promote. Largest single cost lever.

    → Open the Code Prompt Builder
  2. 2

    Enable prompt caching on workloads with shared prefixes

    Per Anthropic prompt caching at docs.anthropic.com, OpenAI at platform.openai.com, and Google Gemini at ai.google.dev, 50-80% savings on cached prefix tokens. Zero quality impact.

  3. 3

    Audit retrieval depth + structured-output usage

    Per Pinecone at pinecone.io and OpenAI structured outputs at platform.openai.com, retrieve top-5 not top-50 (use re-ranker). Use constrained decoding not 'respond in JSON' prompts.

  4. 4

    Add batching, semantic cache, model cascade as workload-fit

    Per OpenAI Batch at platform.openai.com, Anthropic batch at docs.anthropic.com, and Vercel AI Gateway at vercel.com, apply remaining levers based on workload signatures. Batch for async-tolerant. Semantic cache for high-repetition queries. Model cascade for cost-sensitive paths.

Where to start the LLM cost work

If you don't know your monthly LLM spend: Install Helicone at helicone.ai or Langfuse at langfuse.com for usage visibility. Per arxiv research at arxiv.org, can't optimize what isn't measured.

If you're spending $5K+ /month on frontier models: Model right-sizing audit. Per Helicone at helicone.ai, 50-80% of calls typically move to smaller tier with equal quality. Largest single lever.

If you have long shared system prompts: Per Anthropic prompt caching at docs.anthropic.com and OpenAI at platform.openai.com, enable prompt caching. 50-80% savings on prefix tokens. Zero quality impact.

If you have high-volume FAQ-style customer queries: Per Redis semantic caching at redis.io and Pinecone at pinecone.io, add semantic cache. 30-70% cache hit rate eliminates that fraction of LLM calls entirely. The Code Prompt Builder helps design prompts that cache cleanly across this stack.

Frequently Asked Questions

What's the largest LLM cost lever?

Model right-sizing. Per Helicone at helicone.ai and Langfuse at langfuse.com, 50-80% of production LLM calls typically use frontier models where smaller models (10-50× cheaper) would handle the task equivalently. The audit + tier-shift typically delivers 40-70% spend reduction with no quality loss.

Does prompt caching actually save money?

Yes substantially. Per Anthropic prompt caching at docs.anthropic.com, OpenAI at platform.openai.com, and Google Gemini at ai.google.dev, cached prompt prefix tokens cost 10-50% of normal input rates. Break-even is 2-3 cache hits per write. Workloads with shared system prompts, RAG retrievals, or few-shot examples typically see 50-80% savings on cached prefix portion.

When should I use batch API?

Per OpenAI Batch at platform.openai.com and Anthropic batch at docs.anthropic.com, use batch for async-tolerant workloads (24-hour processing acceptable). 50% pricing reduction. Fits background workloads: content generation, summarization, historical-data classification, eval runs. Real-time customer-facing calls aren't batch-tolerant.

How much can semantic caching actually save?

Per Redis at redis.io and Helicone at helicone.ai, high-volume FAQ-style workloads typically see 30-70% LLM call elimination via semantic cache. Cache hits skip the LLM entirely — closest to free LLM responses you can get. Quality risk is real; threshold tuning at ~0.95 similarity is the production pattern.

What's a model cascade?

Per arxiv research at arxiv.org and Vercel AI Gateway at vercel.com, route simple queries to cheap models first; escalate to expensive models only when first-pass confidence is low. Typical workload mix (70% simple / 30% complex) sees ~40% cost reduction. Adds some latency from escalations. Worth it for cost-sensitive workloads where latency is forgiving.

How do I observe + measure LLM costs?

Per Helicone at helicone.ai, Langfuse at langfuse.com, and Vercel AI Gateway at vercel.com, observability platforms track per-call cost + token usage + cache hit rate + model breakdown. OpenRouter at openrouter.ai provides cross-provider visibility for multi-provider stacks. Can't optimize what isn't measured.

Cut LLM bills 60-90% with the 7 cost levers applied in production order.

The Code Prompt Builder structures prompts that compress + cache + cascade cleanly across providers + cost layers. Free, no signup. Part of 40+ free prompt tools.

Browse all prompt tools →