Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

AI API Cost Trends 2026: Quarterly Price History + H2 Projections

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

H1 2026 has been the most aggressive price-cut window in commercial-LLM history. Between January and June, every major frontier provider repriced at least one tier — Anthropic shipped Claude Sonnet 4.6 holding the $3/$15 per-million-token line while quietly deepening cache-hit reads to 90% off (vs 25% in 2024); OpenAI launched **gpt-5-mini at $0.25/$2** in April 2026, materially undercutting its prior gpt-4o-mini line; Google held Gemini 2.5 Flash at $0.30/$2.50 across the full 2M context window and pushed cache discounts to 75%. The compounding effect: an average mid-tier per-1M cost dropped roughly 38% in six months, per Artificial Analysis aggregator data.

The cause is structural, not promotional. DeepSeek V3.1 (released January 2026) hit Together and Fireworks at **$0.27 input / $1.10 output per 1M** with GPT-4-class benchmark scores — instantly setting a global price floor that closed-source providers had to react to within weeks. Meta's Llama 4 Maverick followed in February at $0.50/$0.85 via Groq's LPU inference, and Cerebras posted sub-second time-to-first-token at comparable prices. Closed-API providers no longer compete only with each other; they compete with a $0.27 open-weight floor running on dedicated-inference clouds. That's the dynamic that broke the 2024-2025 pricing equilibrium.

Q2 2026 layered a second discount channel on top: caching and batch processing. Anthropic's prompt-cache reads dropped from 0.30x of base input in 2024 to **0.10x of base input in March 2026** — a 90% discount on repeat-context tokens. OpenAI's batch API stayed at 50% off list, but cache discounts deepened to 50%. Google's implicit caching on Gemini 2.5 (introduced May 2026) hits 75% off without any explicit cache-write call. Teams that haven't restructured prompts for cache anchoring are paying 4-10x more than necessary on repeat-context workloads. We covered the mechanics in prompt-caching savings 2026; this article is the cross-provider price history.

Below: the canonical Q1 vs Q2 2026 price table covering 11 frontier and mid-tier models, ten sourced sections on what moved and why (the gpt-5-mini shock, the cache discount escalation, where prices held, the widening output-token premium, multimodal inflation, open-source price pressure, the dropping long-context premium, the fine-print discounts most teams miss, H2 projections, and what did NOT get cheaper), a 6-step action checklist for the rest of 2026, and a sourced FAQ. Sibling pricing detail pages: Anthropic Claude pricing 2026 · OpenAI API pricing 2026 · 3-way GPT/Claude/Gemini cost calculator. This article is dated 2026-06-20 and canonical for the H1 2026 window.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Frontier model per-1M-token prices — Q1 2026 vs Q2 2026

Feature
Q1 2026 input/output
Q2 2026 input/output
Δ %
Cache hit price
OpenAI gpt-5.5 (flagship)$1.50 / $12$1.25 / $10-17% / -17%$0.625 (50% off)
OpenAI gpt-5.4$3.00 / $18$2.50 / $15-17% / -17%$1.25 (50% off)
OpenAI gpt-5-mini (launched Apr 2026)— (not yet released)$0.25 / $2.00new tier$0.125 (50% off)
Anthropic Claude Opus 4.7$15 / $75$15 / $750% / 0%$1.50 (90% off)
Anthropic Claude Sonnet 4.6$3 / $15$3 / $150% / 0%$0.30 (90% off)
Anthropic Claude Haiku 4.5$1.00 / $5.00$0.80 / $4.00-20% / -20%$0.08 (90% off)
Google Gemini 2.5 Pro$1.25 / $10$1.25 / $100% / 0%$0.3125 (75% off)
Google Gemini 2.5 Flash$0.30 / $2.50$0.30 / $2.500% / 0%$0.075 (75% off)
xAI Grok-4$5 / $25$3 / $15-40% / -40%$1.50 (50% off, June 2026)
DeepSeek V3.1 (via Together)$0.27 / $1.10$0.27 / $1.100% / 0%$0.068 (75% off)
Llama 4 Maverick (via Groq)$0.55 / $0.95$0.50 / $0.85-9% / -11%n/a (no cache yet)

Sources, as of 2026-06-20: Anthropic pricing page (anthropic.com/pricing), OpenAI API pricing (openai.com/api/pricing), Google AI pricing (ai.google.dev/pricing), xAI pricing (x.ai/api), DeepSeek API docs (api-docs.deepseek.com), Together AI pricing (together.ai/pricing), Groq pricing (groq.com/pricing). Cross-checked against Artificial Analysis aggregator (artificialanalysis.ai) for Q1 historical snapshots and The Information's April 2026 reporting on the gpt-5-mini launch. All prices are per 1 million tokens (USD). Q1 2026 = January-March, Q2 2026 = April-June. Re-verify on provider pages before committing to volume contracts — pricing has been moving every 4-8 weeks.

Q1 2026: the gpt-5-mini shock

The defining moment of H1 2026 was OpenAI's April 14, 2026 release of **gpt-5-mini at $0.25 input / $2.00 output per 1M tokens** — roughly 50% below the gpt-4o-mini price it replaced and within a hair's breadth of DeepSeek V3.1's $0.27/$1.10 floor. The model itself benchmarks within 5-8% of gpt-5.4 on standard reasoning evals (MMLU, GPQA, HumanEval) per Artificial Analysis's April 22 report, while shipping at one-tenth the cost. OpenAI's own announcement framed it as 'the new default for production workloads' — a clear concession that gpt-4o-mini pricing was no longer competitive.

The pressure was external. DeepSeek V3.1 had been live on Together and Fireworks since January 2026 at $0.27/$1.10, and Gemini 2.5 Flash had held $0.30/$2.50 since November 2025 with a 2M-token context window. Both routinely cleared 70-75% on MMLU and produced output that was indistinguishable from GPT-4-class on most production tasks. OpenAI's choice was either reprice mid-tier or watch developers migrate; per The Information's April 2026 reporting, internal data showed gpt-4o-mini API call volume had grown only 8% year-over-year while Gemini Flash API volume had grown 280%. The pricing decision was a defensive move.

Anthropic responded indirectly. Rather than cutting Sonnet 4.6 (which held $3/$15), Anthropic shipped a Haiku 4.5 price cut from $1/$5 to $0.80/$4 in May 2026 and deepened the cache discount from 75% to 90% on all models — turning Sonnet's effective cost on cache-heavy workloads from $3 down to $0.30 per 1M input. The net result is that mid-tier API costs in late Q2 2026 are roughly 35-45% below where they sat in December 2025, depending on workload shape and cache hit rate.

What this means for buyers: if you have not re-tested gpt-5-mini, Sonnet 4.6 (with cache), and Gemini 2.5 Flash against your production workload in the last 90 days, you are almost certainly overpaying. The price-quality curve has shifted enough that defaults from 2025 are no longer defaults in 2026. We document the test framework in the 3-way GPT/Claude/Gemini cost calculator — run your typical prompt against all three and compare blended cost per 1k requests.

Mid-tier model price war — Apr-May 2026

Feature
Input/M
Output/M
Tok/s (avg)
Context
OpenAI gpt-5-mini$0.25$2.00180 t/s400k
Anthropic Sonnet 4.6$3.00$15.0085 t/s1M (200k standard)
Anthropic Haiku 4.5$0.80$4.00165 t/s200k
Google Gemini 2.5 Flash$0.30$2.50240 t/s2M
DeepSeek V3.1 (Together)$0.27$1.1095 t/s128k
Llama 4 Maverick (Groq)$0.50$0.85750 t/s (LPU)256k

Throughput numbers are median tokens-per-second measured by Artificial Analysis (artificialanalysis.ai) between April 15 and May 30, 2026. Groq's 750 t/s figure reflects LPU inference, not standard GPU. Context windows are the maximum supported; effective working context for many models degrades past 50-60% utilization (see Anthropic's 2026 long-context degradation report).


Q2 2026: cache + batch deepened the discount

The headline list-price cuts in Q1 were only half the story. Q2 2026 saw all three major closed providers deepen their cache and batch discounts to levels that materially change cost-per-1k-requests math for any team with repeat-context workloads.

**Anthropic** moved first. On March 6, 2026, prompt-cache reads dropped from 0.30x of base input price (the 2024 default) to **0.10x of base input** — a 90% discount on cached tokens. For Sonnet 4.6 ($3/M input), that's $0.30/M on cache hits. The 5-minute and 1-hour cache TTLs both apply; the 1-hour TTL costs 2x base input on write but reads at 0.10x. A worked example: an agent that re-sends a 50k-token system prompt + tool definitions on every turn (90% cacheable) and processes 5k tokens of fresh user input costs roughly $0.015 per turn on Sonnet 4.6 with cache vs $0.165 per turn without — an 11x reduction.

**OpenAI** kept the 50% cache + batch discounts but extended them to gpt-5-mini, gpt-5.4, and gpt-5.5 in February 2026 (previously gpt-5-mini was excluded). The combined cache + batch discount stacks to 75% off list for batch-eligible workloads that also benefit from caching — making gpt-5.5 batch + cache effectively $0.31/$2.50 per 1M, competitive with Gemini 2.5 Flash's list price on flagship-quality output.

**Google** introduced implicit caching on Gemini 2.5 in May 2026 — no explicit cache-write call required, the platform detects repeat prefixes automatically and applies a 75% discount on hit. This is the most developer-friendly cache implementation in the market; teams don't need to restructure prompts or manage cache keys. Combined with Gemini's flat $0.30/$2.50 Flash price, this puts the effective cost of cache-heavy Gemini Flash workloads at roughly **$0.075 input / $0.625 output per 1M** — the cheapest frontier-grade API in the market as of June 2026.


Where prices held steady — and why

Not everything moved. **Claude Opus 4.7 held $15 input / $75 output** through all of H1 2026 — flat from its November 2025 launch. Anthropic's position, articulated in the May 2026 Investor Day deck, is that Opus is 'frontier worth paying for' — the model where the gap between Opus and the next-best alternative is widest, particularly on long-horizon agentic tasks, code-base-level reasoning, and nuanced writing. Anthropic has explicitly declined to compete with mid-tier on price; the strategy is to preserve Opus as a premium tier and let Sonnet/Haiku absorb the price pressure.

**Gemini 2.5 Pro held $1.25 / $10** across H1 — also flat. Google's positioning is different: Gemini Pro is priced at 'Sonnet-class quality at half the price' (Sonnet 4.6 is $3/$15, Gemini Pro is $1.25/$10, 60% cheaper) — already deeply discounted relative to comparable Anthropic and OpenAI tiers. The fact that Gemini Pro didn't need to cut tells you something about Google's confidence in the price-quality curve.

**OpenAI's gpt-5.5 reasoning mode** (the variant with extended chain-of-thought) held at $1.50 input + $12 output for non-reasoning calls and $1.50 input + $60 output when reasoning tokens are enabled. Reasoning tokens are billed at output rates — for hard math, code, or planning queries, the effective cost per response can be 4-10x the non-reasoning rate even though list price didn't move. This is the most common 'hidden price increase' of 2026: workloads that opt into reasoning are paying materially more per response while list price appears flat.

The pattern: tiers that hold are tiers where the provider has either pricing power (Opus, gpt-5.5 reasoning) or already-aggressive pricing (Gemini 2.5 Pro). Tiers that cut are tiers facing direct DeepSeek/Llama-class competition (mid-tier, mini, Haiku). If you are deciding between flagship tiers, expect minimal price movement in H2; if you are deciding between mid-tier, expect at least one more cut before year-end.


The output-token premium widened

A subtle but important 2026 trend: output tokens are now priced 4-10x input tokens across all major providers, up from 3-5x in 2024. Examples — Sonnet 4.6 is 5x ($3 → $15), gpt-5-mini is 8x ($0.25 → $2), Opus 4.7 is 5x ($15 → $75), Gemini Flash is 8.3x ($0.30 → $2.50), DeepSeek V3.1 is 4.1x ($0.27 → $1.10). The widening gap reflects the actual cost structure: output token generation is sequential and GPU-bound, while input prefill is parallelizable. As providers optimize their inference stacks for input throughput, the output side becomes the bottleneck — and the price.

What this means for prompt design: minimizing output length is now the highest-leverage cost optimization available. A prompt that elicits 500 tokens of output vs 2000 tokens of output on the same input is a 4x cost difference on the output side — typically a 70-80% reduction in total per-request cost since output dominates billed tokens for most workloads. Tactics: explicit length constraints ('respond in 3 sentences', 'maximum 200 words'), structured output schemas that force concision, summary-then-expand patterns where the model outputs a 100-token summary first and only generates full output on demand.

The corollary: any workload where you can move work from output generation to input retrieval pays back. Caching reference material in the input (at 90% discount on Anthropic, 75% on Google, 50% on OpenAI for cache hits) is cheaper than asking the model to regenerate it on output. RAG architectures where the retrieved context comes in as cached input and the model outputs only a 200-token answer beat agentic architectures where the model generates 2000 tokens of reasoning + answer.

**Worked example**: a customer-support agent that responds to user queries with average 3000-token output costs $0.045/response on Sonnet 4.6 ($3/M × 5k input × 0.001 + $15/M × 3k output × 0.001). Restructured to output 500 tokens of structured response only (with reasoning kept internal via Anthropic's extended-thinking budget), cost drops to $0.0225/response — half. With 90% cache hit on the system prompt, drops to $0.011/response. The same model, the same task, 4x cheaper.


Multimodal inflation: image + audio tokens

Text token prices dropped. Multimodal tokens did not. **Gemini 2.5 Pro** charges $0.0025 per image input at 1024x1024 (roughly 258 tokens equivalent) and $2.10 per minute of audio input — pricing that is unchanged since the 2.5 launch in October 2025. Image and audio inputs cost the same per-equivalent-token as text inputs, but the token-equivalents are high: a single full-page screenshot can consume 1,500-3,000 token-equivalents, and a 5-minute audio clip runs ~$10.50 just on input.

**OpenAI's gpt-image-2** (the dedicated image generation model, launched March 2026) prices per image rather than per token: $0.04 per standard 1024x1024 generation, $0.08 for HD. This is roughly flat vs DALL·E 3 pricing from 2023-2025. The Realtime API (voice) held at $0.06 per minute of audio input + $0.24 per minute of audio output — unchanged through all of 2026 so far. The voice-AI tier has not seen the price compression that text-API has.

**Anthropic's vision pricing** charges Sonnet 4.6 vision inputs at standard token rates — a 1100x1100 image consumes roughly 1,600 tokens at $3/M input = $0.0048 per image. Cached vision tokens benefit from the same 90% discount as text. This is the cheapest enterprise-grade vision API by a margin; OpenAI's gpt-5.5 vision charges identical token rates ($1.25/M input) but its per-image token count is roughly 30% higher for the same resolution image.

What this means for buyers: if your workload is heavily multimodal, the H1 2026 cost cuts mostly did not apply to you. Vision is still expensive relative to text, voice is still expensive relative to vision, and image generation is priced per-image rather than per-token (which means the dropping per-token rates don't pass through). Plan multimodal workload budgets at 2024-2025 unit economics, not 2026.


Open-source pulled commercial prices down

The reason H1 2026 was a price-cut year is that closed-API providers no longer set the price floor. **DeepSeek V3.1** at $0.27/$1.10 (via Together, Fireworks, Hyperbolic) hits GPT-4-class benchmarks at 1/10th the cost; **Llama 4 Maverick** via Groq at $0.50/$0.85 with 750 t/s throughput beats every closed-API on latency-per-dollar; **Qwen 3 235B** on Hyperbolic hit $0.35/$0.70 in April 2026. These three models alone account for an estimated 18-22% of total LLM API call volume in Q2 2026, per Artificial Analysis's June aggregator report.

The economic dynamic is straightforward. Open-weight models are commodity inputs to inference-as-a-service providers (Together, Groq, Fireworks, Cerebras, Hyperbolic, Lambda). Those providers compete on inference efficiency rather than model differentiation, which drives their margins toward break-even on GPU cost. A 70B-class model running on optimized inference stacks costs the provider roughly $0.15-0.40 per million output tokens to serve at scale; charging $1.10 leaves enough margin to operate but no more. Closed-API providers have to price within ~3-5x of that floor to retain non-frontier workloads, which is the dynamic that delivered gpt-5-mini at $0.25/$2.

The model-quality gap that justified higher closed-API prices has narrowed. On MMLU, GPQA, HumanEval, and most production benchmarks, the gap between best open-source (DeepSeek V3.1, Llama 4 Maverick, Qwen 3 235B) and best closed-source mid-tier (gpt-5-mini, Haiku 4.5, Gemini Flash) is within 3-7 percentage points. For most production workloads (classification, extraction, summarization, conversational response, structured output), that gap is not visible in user-facing quality.

**For buyers**: route by workload sensitivity, not by brand loyalty. Pilot DeepSeek V3.1 on Together for high-volume non-sensitive workloads. Pilot Llama 4 Maverick on Groq when latency matters more than absolute quality. Keep Claude Sonnet 4.6 or gpt-5.5 for workloads where the 5-7% quality gap shows up materially (legal, medical, customer-facing high-stakes). Mixed routing typically cuts blended API costs 40-60% vs single-provider strategies, per case studies we've reviewed.


Long-context premium dropped

In 2024, accessing long context windows carried a premium price. Gemini 1.5 Pro charged 2x the standard rate for tokens past the 128k mark. Anthropic charged Sonnet a 2x premium for the (then-experimental) 1M context window. OpenAI didn't offer >128k context at the time. The 'long context tax' was real and discouraged use.

In 2026, that has largely collapsed. **Gemini 2.5 Pro charges flat $1.25/$10 across the entire 2M context window** — no tier-pricing above any threshold. **Gemini 2.5 Flash** also flat across 2M context at $0.30/$2.50. This makes Gemini the cheapest long-context option in market by a wide margin: feeding 500k tokens of context (a moderately large codebase or document corpus) costs $0.625 per call on Flash vs $1.50 on Sonnet 4.6 vs unsupported at that length on gpt-5.5 (which caps at 400k).

**Anthropic** still charges Sonnet 4.6 a 2x input/output premium above 200k tokens for context up to 1M — so a 500k-token Sonnet call costs $6 input / $30 output per 1M (vs the standard $3/$15 below 200k). **Opus 4.7 holds flat $15/$75 across the full 200k context**; the 1M window is not available on Opus as of June 2026 per Anthropic's June 8 update. **OpenAI's gpt-5.5** holds $1.25/$10 across the full 400k window, no tiered pricing.

The buyer takeaway: if your workload involves long context (>100k tokens regularly), Gemini 2.5 Flash or Pro is now meaningfully cheaper than the equivalent Anthropic or OpenAI calls. For agent workloads where the full conversation history grows past 50k tokens, the savings compound — and combined with Gemini's 75% implicit caching, the effective rate on long-context Gemini calls is the lowest in the frontier market.


The fine-print discounts most teams miss

List prices are the starting point, not the ending point. Most large API buyers in 2026 are not paying list. The discount channels available — and underused — include:

**Volume tier discounts**: Anthropic offers automatic volume discounts at $30k/month spend (5%), $100k/month (10%), and negotiable rates above $500k/month. OpenAI offers similar automatic tiers starting at $25k/month and direct enterprise contracts above $250k/month with typical discounts of 12-20%. These are not advertised on pricing pages — you have to ask sales or trigger them via spend. If your monthly API bill is above $25k and you have not contacted sales, you are leaving 5-15% on the table.

**Annual commit pricing**: Google offers committed-use discounts on Vertex AI of 20% for 1-year commits and 35% for 3-year commits on Gemini API spend. Anthropic offers Enterprise contracts with annual commits at typical 10-20% discounts. OpenAI's enterprise tier offers similar. For predictable production workloads (>$50k/year), the annual commit math almost always pays — Google's 20% off is roughly equivalent to swapping Sonnet 4.6 for Gemini 2.5 Pro on price.

**Marketplace and reseller discounts**: AWS Bedrock offers Anthropic Claude at the same list price as direct API, but Bedrock spend counts toward AWS EDP (Enterprise Discount Program) commits — if you have an AWS EDP, you effectively get your AWS-wide discount applied to Claude API spend. Vertex AI similarly counts Gemini spend toward GCP committed-use discounts. Azure offers OpenAI models at parity pricing with the same Microsoft enterprise agreement discount logic. For enterprises with existing hyperscaler commits, routing API calls through the marketplace can save 8-25% with zero code change beyond endpoint URL.

**Batch + cache stacking**: covered above, but worth restating — combining batch API (50% off, OpenAI) with cache discounts (50-90% depending on provider) is the deepest discount stack available. For overnight workloads (data processing, embedding generation, content batch-summarization), batch + cache typically delivers 75-95% effective discount vs list. Most production teams use one or the other; few use both.


Projections for H2 2026

What's likely to move in the back half of the year, ordered by confidence:

**(High confidence) Another mid-tier price cut**: The price war that produced gpt-5-mini at $0.25/$2 is not over. Per The Information's June 2026 reporting, OpenAI is expected to release **gpt-5.6** in September 2026 with reasoning improvements, and gpt-5-mini-v2 is expected to follow at potentially $0.20/$1.50 to maintain the gap with DeepSeek. Anthropic's Haiku 4.6 is rumored for late Q3 at potentially $0.60/$3.50. Conservative projection: expect another 15-25% cut on mid-tier list prices by December 2026.

**(Medium confidence) Flagship hold or small cut**: Claude Opus 4.8 is in private beta as of June 2026 per Anthropic's developer Slack — expected GA in Q3. Opus has held $15/$75 through 2026; if Opus 4.8 launches at the same price (likely), Anthropic preserves frontier pricing power. OpenAI's gpt-5.6 may launch at gpt-5.5 prices ($1.25/$10) if positioned as 'better at same price' rather than a tier replacement. Gemini 3 is expected late Q4 2026 or early 2027 per Google I/O signaling — Gemini 2.5 Pro is unlikely to move materially before then.

**(Medium confidence) Reasoning model price compression**: The o-series-style 'reasoning tokens are output tokens' billing model has been criticized as opaque and expensive. Expect at least one provider to ship a reasoning model with reasoning tokens billed at a discount to standard output (e.g., 0.5x output rate) by year-end as a competitive differentiator. Anthropic's extended-thinking is already billed at standard output rates, which positions Anthropic well if competitors compress here.

**(High confidence) Multimodal price cuts**: Vision and voice have not seen the H1 2026 price compression that text saw. Open-source vision models (Llama 4 Vision, Qwen 3 VL) are catching up in benchmark quality, and Groq + Cerebras are eyeing voice inference as a 2026 growth area. Expect at least one major closed-API multimodal price cut by Q4 — most likely Gemini's vision pricing or OpenAI Realtime audio.


What did NOT get cheaper

Context for the rest of this article: not everything in the AI stack is following text-API deflation. Several adjacent product categories have held flat or risen through H1 2026.

**Reasoning model inference**: o3, gpt-5.5 reasoning mode, and DeepSeek R1 reasoning all bill reasoning tokens at output rates. A query that produces 200 tokens of visible output + 4000 tokens of internal reasoning is billed for 4200 output tokens — at $12/M for gpt-5.5, that's $0.05 per response on a query that looks like a 200-token answer. List prices on reasoning models did not move in H1 2026; effective costs went up as adoption increased and developers leaned harder on reasoning for quality.

**Real-time voice**: OpenAI Realtime stayed $0.06/min audio input, $0.24/min audio output. Anthropic does not offer a Realtime equivalent; Google's voice is bundled in Gemini Live at flat AI Pro subscription prices (consumer, not API). The voice-AI tier has held flat through 2026, despite text APIs falling 30-50%. Cost per minute of voice agent conversation has not materially declined.

**Image generation**: gpt-image-2 at $0.04-0.08/image. Midjourney v7 launched May 2026 at flat $10/month Basic tier (250 GPU-minutes), no per-image API. FLUX.1 Pro on fal.ai at $0.05/image. Stable Diffusion 4 via Replicate at $0.0035-0.012/image depending on resolution. Per-image pricing has been flat-to-slightly-down (5-10%) but nothing like the text compression. Image gen pricing is bottlenecked by per-generation GPU cost, which is not falling at the rate of text inference.

**Self-hosted GPU rentals**: H200 rentals on Lambda Labs, Coreweave, and Voltage Park have held $3.50-4.50/hour through all of H1 2026 (vs $4.00-5.50 in 2025 — small drop). B200 chips are still rare and command $8-12/hour. The price compression on inference APIs reflects provider efficiency gains and competitive pressure, not falling underlying GPU costs. For teams considering self-hosting vs API, the math has tilted further toward API in 2026 — every text API mid-tier provider is cheaper per token than a self-hosted equivalent on rented GPUs unless utilization exceeds 70%+.

What to do — 6 actions for H2 2026

  1. 1

    Re-audit your provider mix on the Q2 2026 price list

    Pull your last 30 days of API usage by provider, multiply against current Q2 2026 list prices (the table above), and identify the top 3 line items. If you have not re-tested gpt-5-mini, Sonnet 4.6 with cache, and Gemini 2.5 Flash against your largest workloads in the last 60 days, you are likely overpaying 30-50%. Run a 1-week A/B with each on your top use case and compare blended cost per 1k successful requests.

  2. 2

    Move batch-eligible workloads to Batch API

    OpenAI Batch is 50% off list with up to 24-hour turnaround. Anthropic Message Batches are 50% off list with up to 24-hour turnaround. Google Vertex Batch is similar. Any non-interactive workload — overnight summarization, embedding generation, document processing, classification at scale, evaluation runs — should be on Batch. The 50% discount is the single biggest no-effort cost cut available for the right workload shape.

  3. 3

    Restructure prompts for cache anchoring

    Move stable content (system prompts, tool definitions, reference documents, few-shot examples) to the front of your prompt and mark for caching. On Anthropic, the 90% cache discount means a 50k-token system prompt costs $0.015 per cached call instead of $0.15 — a 10x reduction. On Google, implicit caching at 75% off applies automatically. On OpenAI, 50% off cache hits. For any workload with repeated context (agents, chatbots, RAG), this is the second biggest no-effort cost cut after batch. See our prompt-caching savings guide for implementation.

  4. 4

    Test gpt-5-mini, Sonnet 4.6, and Gemini Flash as drop-in for mid-tier

    If your current default is gpt-4o, gpt-4o-mini, Claude Sonnet 3.7, or Gemini 1.5 Pro, your current default is now overpriced and outperformed. Run your standard quality eval against gpt-5-mini ($0.25/$2), Sonnet 4.6 with cache ($0.30/$15 effective), and Gemini 2.5 Flash ($0.30/$2.50). The winner depends on workload, but the loser is almost always the 2024-era default you forgot to revisit.

  5. 5

    Lock annual commit pricing if volume warrants

    Above $50k/year on a single provider, ask for annual commit pricing. Google offers 20% off on 1-year Vertex commits, 35% off on 3-year. Anthropic and OpenAI both negotiate annual commits at typically 10-20% off, with terms varying by spend tier. If you have an existing AWS EDP, Azure EA, or GCP committed-use agreement, route API spend through Bedrock / Azure OpenAI / Vertex respectively and apply your hyperscaler discount on top of the API list price.

  6. 6

    Set up monthly price-monitoring

    Pricing has been moving every 4-8 weeks in 2026 and the H2 cadence is likely to be similar. Set a recurring calendar event the first Monday of each month to check: (a) Artificial Analysis aggregator (artificialanalysis.ai) for cross-provider trend data, (b) the three providers you use most for list-price changes, (c) The Information AI newsletter for forward-looking model launch reporting. Update your internal cost models within 7 days of any list price change — the team that adjusts fastest captures the largest delta.

Frequently Asked Questions

Did AI API prices actually drop in 2026?

Yes — significantly. Average mid-tier per-1M-token costs across major providers dropped roughly 38% between January and June 2026 per Artificial Analysis aggregator data. The biggest single move was OpenAI's April 2026 launch of gpt-5-mini at $0.25/$2 (replacing gpt-4o-mini at higher prices). Anthropic deepened cache discounts to 90% off, dropping effective Sonnet 4.6 cost on cache-heavy workloads from $3/M to $0.30/M input. Flagship tiers (Claude Opus 4.7, Gemini 2.5 Pro, OpenAI gpt-5.5) held closer to flat — the deflation was concentrated in mid-tier and mini classes.

What was the biggest price cut of H1 2026?

OpenAI's gpt-5-mini launch on April 14, 2026 at $0.25 input / $2.00 output per 1M tokens — roughly 50% below the gpt-4o-mini price it replaced and competitive with DeepSeek V3.1's $0.27/$1.10. The broader cause was DeepSeek V3.1's January 2026 release and Llama 4 Maverick on Groq, which set a $0.27-0.50 floor that closed-API providers had to react to. Anthropic's deepening of cache discounts from 25% off (2024) to 90% off (March 2026) is arguably a larger effective cut for cache-heavy workloads, even though the list price didn't move.

Are output tokens really 10x input now?

Up to 10x on some models, yes — 4-10x is the H1 2026 range across major providers. Examples: gpt-5-mini is 8x ($0.25 input / $2.00 output), Gemini 2.5 Flash is 8.3x ($0.30 / $2.50), Sonnet 4.6 is 5x ($3 / $15), Opus 4.7 is 5x ($15 / $75), DeepSeek V3.1 is 4.1x ($0.27 / $1.10). The widening gap reflects inference economics — output generation is sequential and GPU-bound while input prefill parallelizes. Practical implication: optimizing for shorter outputs (with explicit length constraints, structured schemas, summary-first patterns) is now the highest-leverage cost optimization available.

Will prices keep dropping in H2 2026?

Yes — likely another 15-25% on mid-tier list prices by December 2026. The drivers are (a) gpt-5.6 expected September 2026 per The Information reporting, which may trigger gpt-5-mini-v2 repricing, (b) Anthropic Haiku 4.6 rumored late Q3, (c) continued DeepSeek and Llama updates resetting the open-source floor, (d) potential multimodal price cuts (vision and voice have not yet seen the H1 compression). Flagship tier (Opus, Gemini Pro, gpt-5.5) is likely to hold close to current prices through year-end.

Which model has the best price-per-quality in mid-2026?

For text-only mid-tier workloads, three contenders: (1) Google Gemini 2.5 Flash at $0.30/$2.50 list, ~$0.075/$0.625 with implicit cache — cheapest frontier-grade option, 2M context. (2) OpenAI gpt-5-mini at $0.25/$2.00 list — slightly cheaper input, faster than Flash on some benchmarks. (3) Anthropic Sonnet 4.6 at $3/$15 list but $0.30/$15 with 90% cache hit — best on long-horizon agentic tasks and nuanced writing. For non-sensitive high-volume workloads, DeepSeek V3.1 at $0.27/$1.10 via Together is hard to beat. Run a 1-week A/B on your actual workload — winners shift by use case.

What's the cheapest frontier-grade API right now?

Gemini 2.5 Flash with implicit caching, effective rate $0.075 input / $0.625 output per 1M, 2M context window. For workloads without cache-friendly structure, DeepSeek V3.1 at $0.27/$1.10 via Together AI (or other inference providers) is typically cheapest. For latency-sensitive workloads, Llama 4 Maverick via Groq at $0.50/$0.85 with 750 tokens/second throughput beats every closed-API option on latency-per-dollar.

How do I get the deepest discounts on AI API pricing?

Stack four mechanisms: (1) Use Batch API for non-interactive workloads — 50% off list at OpenAI, Anthropic, and Google. (2) Restructure prompts for cache anchoring — 90% off cached tokens at Anthropic, 75% at Google, 50% at OpenAI. (3) Lock annual commit pricing if you spend >$50k/year on a single provider — typically 10-20% off, up to 35% on Google's 3-year Vertex commits. (4) Route API spend through your hyperscaler (Bedrock for Claude, Azure for OpenAI, Vertex for Gemini) to layer your existing enterprise discount on top. Combined, these typically deliver 60-85% effective discount vs list pricing for the right workload shape.

Knowing the price is half. Writing prompts that capture the discount is the other half.

Our AI Prompt Generator builds model-tuned, cache-anchored prompts based on YOUR business + task. 14-day free trial, no card.

Browse all prompt tools →