By The DDH Team · Digital Dashboard Hub

Prompt Caching Savings Across Providers (2026): Anthropic 90%, OpenAI 50%, Google 75%

By The DDH Team at Digital Dashboard Hub·Updated June 20, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Prompt caching is the single biggest cost lever in 2026, and most teams are leaving 40-70% of their bill on the table because their prompt structure silently misses the cache. Frontier-model unit prices have crept up through Q1 and Q2 2026 (covered in our AI cost trends quarterly), but the headline number that matters for production workloads is the cache hit rate — not the per-token sticker price. A Sonnet 4.6 call that costs $0.063 uncached drops to $0.0095 when the prefix hits cache. That is the same model, the same answer, an 85% bill reduction, and the entire delta is structural.

The three frontier providers each implement caching differently in ways that are not obvious from the docs. **Anthropic** runs an explicit, opt-in cache: you mark up to 4 breakpoints with `cache_control` and get 90% off input tokens on hits. **OpenAI** runs an automatic, exact-prefix cache: zero code changes, 50% off, but a single dynamic character at the top of your prompt drops you to 0% hit rate. **Google** runs both an implicit cache (free, automatic, opportunistic) and an explicit `contextCache` (paid storage, guaranteed retention, 75% off, up to 24-hour TTL) — and uniquely supports multimodal cache (images, video, audio) at full discount.

Below: side-by-side caching mechanics across all three providers, the 4 anti-patterns that silently destroy cache hit rate, worked savings examples for agents and RAG workloads, the Anthropic breakpoint strategy that lets you layer caches with different TTLs, the cache-write cost math (yes, the first call is *more* expensive — payback requires 5+ reuses), and the multi-provider strategy real production teams use. Sourced to provider docs as of June 20, 2026.

Related reading: Claude pricing 2026 covers the underlying Anthropic per-token rates, agent loop cost optimization walks through caching as one of seven levers in long-running agent workloads, and AI cost trends 2026 quarterly tracks how cache discounts have shifted over the year. Want cache-anchored prompts written for you? Our ChatGPT prompt generator produces stable-prefix, dynamic-suffix prompts with provider-specific breakpoints already in place.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

Prompt caching mechanics — Anthropic vs OpenAI vs Google, June 2026

Feature	Anthropic	OpenAI	Google
Discount on cache hit	90% off input tokens	50% off input tokens	75% off input tokens
Minimum cacheable tokens	1,024 tokens	1,024 tokens	4,096 tokens
Default TTL	5 minutes	5-10 minutes (variable)	1 hour (explicit cache)
Extended TTL (paid)	1 hour (2x storage cost)	N/A — fixed 5-10 min	24 hours (extended cache)
Cache write cost (vs base input)	1.25x base input	No surcharge	$1/M tokens/hour storage
Activation	Explicit (cache_control)	Automatic (exact prefix)	Explicit OR implicit
Per-request breakpoint count	Up to 4 breakpoints	N/A (single prefix)	1 contextCache per request
Multimodal cache support	Yes (vision + text)	Yes (gpt-image-2, vision)	Yes (image, video, audio)
Tool definitions cacheable	Yes	Yes (in prefix)	Yes
System prompt cacheable	Yes	Yes (in prefix)	Yes
Trajectory / message history cacheable	Yes (incremental)	Partial (prefix only)	Yes (full cached context)

Sources, as of June 20, 2026: Anthropic docs (docs.anthropic.com/en/docs/build-with-claude/prompt-caching), OpenAI prompt caching guide (platform.openai.com/docs/guides/prompt-caching), Google AI context caching docs (ai.google.dev/gemini-api/docs/caching). Anthropic's 1-hour extended TTL is generally available on Sonnet 4.6 and Opus 4.8; Haiku 4.5 supports 5-min only. OpenAI's automatic prefix cache requires prompts ≥1,024 tokens and caches in increments of 128 tokens above that threshold. Google's implicit cache hits opportunistically with no storage fee; explicit contextCache requires the 4,096-token minimum and bills storage by the hour. Re-verify before committing to architecture changes — caching terms shifted twice in 2025.

How each provider's cache actually works

**Anthropic — explicit, opt-in, you choose what gets cached.** You insert a `cache_control: {type: 'ephemeral'}` block at one of up to four breakpoints in your messages array. Everything *before* the breakpoint is hashed, written to cache on first request, and reused on subsequent requests within the TTL window. The 5-minute default TTL is automatic; the 1-hour extended TTL is opt-in with `ttl: '1h'` and doubles the storage cost (which is baked into the 1.25x cache-write surcharge applied at write time, not separately billed). Cached input reads at 0.1x base input rate — the headline 90% discount.

**OpenAI — automatic, no opt-in, exact prefix match.** Any prompt ≥1,024 tokens routed to a cache-eligible model (gpt-5.4, gpt-5.4-mini, gpt-5.4-turbo, o-series reasoning models, gpt-image-2, gpt-4.1) is automatically hashed in 128-token increments. If a subsequent request shares an exact prefix, the matched portion bills at 0.5x — the 50% discount. There is no `cache_control` parameter, no breakpoint marking, no API surface to control it. You can only verify it happened after the fact via `usage.prompt_tokens_details.cached_tokens` in the response.

**Google — implicit + explicit, two modes.** Implicit caching is automatic and free: Gemini opportunistically caches portions of prompts that look reusable. When implicit cache hits, you pay nothing extra and get the 75% discount on the cached portion. Implicit cache has no guarantees — it might hit, it might not. Explicit `contextCache` is the paid mode: you POST a chunk of content (system prompt + tools + documents + few-shots), get back a cache name, and reference that name in subsequent requests. Explicit cache guarantees the discount, supports up to 24-hour TTL, and bills storage at $1 per million tokens per hour. Minimum cacheable content is 4,096 tokens (significantly higher than Anthropic/OpenAI's 1,024).

The headline implication: Anthropic gives you the most control and the deepest discount but requires deliberate prompt architecture. OpenAI is the lowest-friction option but you have zero levers when it misses. Google is the most flexible (free implicit + paid explicit) and is the only realistic choice for multimodal cache (cached video frames, audio chunks, image arrays).

The Anthropic 90%: when it pays back biggest

The Anthropic discount is the largest in the market and pays back hardest on workloads with long, stable prefixes that get hit repeatedly within the TTL window. The canonical shape: a long system prompt + tool definitions + few-shot examples + retrieved-document context that is identical across calls, with only the user turn changing.

**Worked example — Claude Sonnet 4.6, 20k stable prefix + 500 dynamic input + 200 output:**

Uncached call: 20,500 input tokens × $3/M = $0.0615 input + 200 output tokens × $15/M = $0.003 output = **$0.0645 per call**.

Cached call (prefix hit, 20k cached + 500 dynamic): 20,000 cached input × $0.30/M = $0.006 + 500 dynamic input × $3/M = $0.0015 + 200 output × $15/M = $0.003 = **$0.0105 per call**.

**Savings: 83.7% reduction per call, $0.054 saved per call.** At 100,000 calls per month, that is $5,400 saved monthly versus the uncached path. The first call of each 5-minute TTL window costs 25% extra ($0.077 instead of $0.0645) due to the cache-write surcharge, but the second through Nth calls each save $0.054.

**Breakeven math:** the cache-write surcharge costs you ~$0.015 extra on the first call. You need just 0.28 follow-up cache hits to break even — i.e., if any second request lands within the TTL window, you are already net positive. The economics are absurd in favor of caching for any prefix you reuse more than once.

**Real Anthropic workloads where this dominates:** Claude Code-style agents (20-50k of tool definitions + project context reused on every tool turn), customer-support assistants (10-30k system prompt + brand voice + escalation policies reused per conversation turn), document Q&A over a fixed corpus (50-150k tokens of source documents reused across all user questions), code review agents on a stable repo (15-40k of repo context reused per review).

The OpenAI 50%: smaller discount, zero setup cost

OpenAI's 50% is half the Anthropic discount, but it requires no code changes whatsoever. Any qualifying request — ≥1,024 tokens, on a cache-eligible model — is automatically eligible. The discount applies in 128-token increments above the 1,024 floor: a 5,000-token prefix that matches a previous request will see roughly 5,000 / 128 = 39 cache-eligible chunks, all billed at 0.5x.

The sweet spot: long-running production deployments where the same system prompt is sent within ~5-10 minutes of itself by many concurrent users. Chatbots in active business hours, agentic workflows that fire many requests per minute, batch jobs that process many records with the same instruction prefix. OpenAI's cache also routes per-server (the request must hit the same back-end node for the cached prefix to be available), so high-throughput workloads with traffic spread across many nodes hit cache more often than low-volume sporadic traffic.

**Cache miss happens silently on prefix drift** — and this is the most common failure mode. A single dynamic byte at the top of the prompt (a timestamp injected by the framework, a UUID in the system message, a randomly shuffled few-shot order) drops the cache hit rate to 0%. The OpenAI dashboard surfaces a 'cached tokens' field in usage reports; if your cached_tokens / prompt_tokens ratio is <30%, you have prefix drift. Fix it before scaling.

**Verify cache hits programmatically:** every OpenAI response includes `usage.prompt_tokens_details.cached_tokens`. Log this field. If you're not logging it, you have no way to know whether your cache strategy is working.

OpenAI cache hit vs miss — gpt-5.4, 5k stable + 500 dynamic input

Feature	Scenario	Per-call $
Cache hit (5k prefix matched)	$0.00513	$513
Cache miss (no prefix match)	$0.00825	$825
Weighted average @ 70% hit rate	$0.00607	$607

Math assumes gpt-5.4 at $1.50/M input (cached at $0.75/M = 50% off) and $7.50/M output, with 200 output tokens per call. 70% cache hit rate is realistic for active production prompts with stable prefix architecture; new or low-volume workloads typically see 30-50%. Source: OpenAI pricing page as of June 20, 2026.

The Google 75%: extended TTL + multimodal

Google sits between Anthropic and OpenAI on discount (75% vs 90% / 50%) but offers two structural advantages that make it the best choice for specific workload shapes: extended TTL up to 24 hours, and full multimodal cache support.

**24-hour TTL means daily-stable contexts cache cleanly.** A knowledge-base assistant that loads the company wiki at the start of the day can pay storage for 24 hours and serve every query against it at the 75% discount. Compare against Anthropic's 1-hour max — you'd need to re-write the cache 24 times per day on Anthropic versus once on Google. For very large stable contexts (>500k tokens), the storage savings from longer TTL frequently outweigh the per-token discount difference.

**Multimodal cache is unique to Google.** You can cache video clips, image arrays, audio chunks, and PDFs alongside text. For video-QA applications (caching a movie or a multi-hour security feed once and querying it many times), document analysis applications (caching a 1,000-page legal corpus with embedded figures), or audio transcription workflows (caching long podcasts or meetings), Google's multimodal context cache is structurally the only economical option. Anthropic and OpenAI both cache vision inputs in principle but the workflow ergonomics around caching long video or audio assets are weak.

**Storage cost math:** Google's $1/M tokens/hour storage means a 100k-token cached context costs $0.10/hour to keep warm. Over 24 hours that's $2.40 per cached context per day. If you serve >50 queries/day against that context at typical request shapes, the storage cost is a rounding error vs the per-query discount; if you serve <10 queries/day, explicit caching loses to implicit-only mode.

**Implicit cache is the underrated free lunch.** You get it whether you ask for it or not — Gemini automatically caches portions of incoming prompts it recognizes as reusable patterns, and applies the 75% discount on hits without billing storage. For workloads where you can't justify explicit cache management, implicit cache still typically saves 15-35% on input bills with zero code changes.

The 4 anti-patterns that silently disable cache

**Anti-pattern 1 — timestamp or UUID in system prompt.** The single most common failure. A framework injects `Current time: 2026-06-20T15:42:17.483Z` or `Session ID: 7f3e8a-...` at the top of the system message. The hash differs per request. Cache hit rate collapses to 0%. Audit your framework's default system-message template before deploying. If you need timestamps in the prompt, put them BELOW the cache breakpoint or in the user turn.

**Anti-pattern 2 — dynamic few-shot example shuffling.** Some prompt frameworks randomize the order of few-shot examples per request to reduce position bias. This makes the prompt prefix different on every call. The fix: pick a fixed canonical ordering at deploy time, freeze it, and live with whatever position bias remains. The cache savings dwarf the position-bias penalty for any production workload.

**Anti-pattern 3 — tool definition reordering or dynamic injection.** Agentic frameworks sometimes filter the tool list based on user role, recent usage, or RAG-retrieved relevance. Every variation produces a different cache key. The fix: send the full canonical tool list to everyone (the model can ignore irrelevant tools), or partition users into a small number of fixed cohorts each with its own stable tool list.

**Anti-pattern 4 — user content prepended to system prompt.** Some teams inject user-specific data ('User name: Alice. Account tier: Pro. Last seen: 2 hours ago.') at the top of the system message for personalization. This creates per-user cache fragmentation: every user gets their own cache entry, and unless individual users repeatedly hit the API within the TTL window, none of them benefit. The fix: put personalization in the user turn, or at the END of the system prompt below the cache breakpoint.

**How to detect:** instrument every response with the provider's cache-hit field (`cache_read_input_tokens` on Anthropic, `cached_tokens` on OpenAI, `cachedContentTokenCount` on Google). Plot the ratio of cached / total input tokens over time. If your prefix architecture is right, that ratio should stabilize above 60% for production workloads within ~7 days of deployment. If it's stuck below 30%, you have one of the four anti-patterns above.

Cache breakpoint strategy on Anthropic (up to 4 marks)

Anthropic's 4-breakpoint model is the most powerful caching primitive in the industry — and the most under-used. The strategy: layer your prompt from most-stable to least-stable, mark each layer's end with a breakpoint, and let each layer cache independently with its own implicit TTL.

**Canonical 4-layer architecture:**

Layer 1 (system prompt, ~5-15k tokens): brand voice, role, behavioral rules, refusal policies. Changes monthly at most. Mark with cache_control + 1h TTL.

Layer 2 (tool definitions, ~5-30k tokens): the full JSON schema for every tool the agent can call. Changes only when you ship a new tool. Mark with cache_control + 1h TTL.

Layer 3 (few-shot examples + retrieved corpus, ~10-100k tokens): worked examples, RAG-retrieved documents, KB snippets. Changes per session or per topic cluster. Mark with cache_control + 5min TTL (default).

Layer 4 (conversation trajectory, variable): the message history of the current conversation. Grows turn-by-turn. Mark the end of each completed turn with cache_control so the trajectory caches incrementally as the conversation progresses.

**Why this matters:** when a request comes in, Anthropic walks the cache from longest matching prefix backwards. A new conversation hits Layer 1 + 2 + 3 cached and only the user turn (Layer 4) is fresh — 95%+ of tokens at 0.1x. A continuing conversation hits Layer 1 + 2 + 3 + most of Layer 4 cached — 98%+ of tokens at 0.1x. The architecture compounds savings across the entire trajectory.

**Common mistake:** marking only the system prompt and letting tool definitions + corpus re-bill at full price every call. If you have 30k of tool definitions and you're not caching them, you're leaving more than half your potential cache savings on the table.

Cost of cache writes — usually overlooked

Cache writes are not free. Anthropic charges 1.25x base input rate the first time a cache key is written (this covers both compute and storage for the TTL window). On Sonnet 4.6 at $3/M base input, cache writes bill at $3.75/M. On Opus 4.8 at $15/M base input, cache writes bill at $18.75/M.

**The first call is therefore MORE expensive than the uncached baseline.** Reuse from call 2 onward is where the savings come from. Breakeven math for the 1.25x write surcharge: you pay an extra 0.25x on call 1, and save 0.9x on each subsequent call. Breakeven = 0.25 / 0.9 = 0.28 — i.e., if your cache hits even once after the write, you are already net positive. The economics demand >1 reuse to win, but the bar is genuinely low.

**Where this bites:** workloads with cache writes that never get reused. Examples include: per-user system prompts that fragment cache across thousands of users (most users never make a second call within 5 minutes), randomized few-shot orders that change the cache key on every request, A/B testing system prompts where each variant gets traffic too sparse to amortize the write.

**Google's storage model:** explicit contextCache bills $1 per million tokens per hour of storage. A 100k-token explicit cache costs $0.10/hour, or $2.40 for a 24-hour TTL. To make that economical you need enough queries against the cache to amortize the storage. At Gemini 2.5 Pro's ~$1.25/M input rate, 75% off saves you ~$0.094 per 100k cached tokens read. You need roughly 26 reads per hour (or ~625 reads per 24 hours) to break even on storage. High-traffic shared contexts (a customer-support bot serving thousands of users) easily clear that bar; low-traffic per-user contexts do not.

**OpenAI has no write cost.** This is one of OpenAI's structural advantages — there's no first-call penalty, no break-even math, no storage line item. The 50% discount is just smaller than what Anthropic and Google offer.

Workloads where caching is mandatory (not optional)

**Agents with long tool definitions.** Claude Code, Cursor Agent, autonomous research agents, anything that loops through tool calls. The system prompt + tool schema + conversation trajectory often runs 30-80k tokens that are nearly identical between consecutive turns. Without caching, agent workloads frequently spend $10-50 per agentic task; with caching, the same tasks land at $1-5. This is the single largest cost differential in 2026 LLM use. Walk-through in our agent loop cost optimization guide.

**RAG over stable corpora.** Loading a 100k-token knowledge base once and answering 500 questions against it for 12 hours. Without caching: 500 × 100k × $3/M = $150 in input alone. With Anthropic cache + 1h TTL (12 cache writes through the day): ~$15. With Google 24h cache: ~$8.

**Customer-support assistants with brand prompts.** 10-30k tokens of brand voice, escalation policies, product knowledge reused across every conversation. Without caching, you'd pay $0.05-0.15 per turn just on the system prompt; with caching, $0.005-0.015. Across millions of conversations per month, this is the difference between a $300k/month bill and a $30k/month bill.

**Code review agents on stable repos.** Repo context (file tree, dependency graph, README, conventions) often hits 20-50k tokens that don't change between reviews on the same repo. Caching this layer brings code-review costs in line with developer-tool budgets ($0.05-0.20 per review vs $1-3 uncached).

**Document Q&A.** Legal review on a 500-page contract. Medical research over a set of clinical trial papers. Academic Q&A over a textbook. Caching the document once and querying it 50 times costs less than uncached querying twice. The 90%/75% discounts make these workloads economically viable that wouldn't be otherwise.

Workloads where caching doesn't help

**Single-shot generations with novel prompts.** One-off content creation, ad-hoc 'write me a blog post about X' calls, brainstorming sessions where every prompt is different. There's no second call to amortize the cache write against. Don't bother — the cache-write surcharge actively costs you money on workloads with no reuse.

**Dynamic per-user personalization that breaks prefix.** If your product injects user-specific data at the top of every prompt and most users don't make repeat calls within the TTL window, caching fragments across users and never reuses. Either restructure to put personalization at the bottom (so the upper layers can cache across users), or accept that caching won't help this workload.

**Image generation prompts.** Midjourney, DALL·E 3, FLUX, gpt-image-2, and similar text-to-image services don't expose prompt caching the way text LLMs do. Image generation is dominated by GPU inference cost on the image side, not input token cost on the prompt side. The economics are different.

**Very short prompts (<1,024 tokens).** Both Anthropic and OpenAI require ≥1,024 tokens to be cacheable; Google requires ≥4,096. If your typical prompt is shorter than the minimum, caching is mechanically unavailable. Note that 'short' here means including the full system prompt + tools + history, so most production prompts clear the bar even if the user message is brief.

**Embedding generation.** Embedding APIs (Anthropic Voyage, OpenAI text-embedding-3-large, Google text-embedding-005) don't have prompt caching. Each embedding request is a fresh forward pass. Optimize embedding costs by deduplicating inputs and caching outputs in your own vector store, not by trying to use provider prompt cache.

Multi-provider caching strategy

Production teams in 2026 increasingly route different workloads to different providers based on caching mechanics, not just per-token price. The decision tree we see most often:

**Anthropic for agentic workloads.** The 90% discount + 4-breakpoint architecture + 1-hour TTL + per-turn incremental caching is uniquely suited to the agent loop pattern. Claude Sonnet 4.6 + aggressive caching is the cost floor for autonomous agents in 2026. Anthropic also publishes the most generous trial-and-experimentation terms for caching (see our Claude pricing 2026 breakdown).

**Google for long-document and multimodal.** Gemini 2.5 Pro with 24-hour explicit contextCache + multimodal cache is the only economical path for video QA, long-document analysis (>200k token corpora reused across many questions), and audio transcription with reused context. The 24-hour TTL means once-per-day cache rewrites instead of once-per-hour.

**OpenAI for catch-all and zero-setup workloads.** The automatic prefix cache + 50% discount + no surcharge + no breakpoint management makes OpenAI the default for teams that don't want to manage cache architecture explicitly. Smaller discount but lower operational complexity. Also the right choice for OpenAI-only model dependencies (Sora 2 video, gpt-image-2 generation, o-series reasoning) where caching ergonomics aren't the deciding factor.

**Cross-provider routing in practice:** real production stacks now route by workload shape. An e-commerce assistant might use Anthropic Sonnet for agentic order-status investigations (long stable prefix per session), Google for product-catalog-QA over a 500k-token cached catalog (24-hour TTL), and OpenAI for one-off marketing copy (no caching benefit anyway). Total bill drops 60-80% vs single-provider all-uncached, with zero quality compromise.

**The meta-skill:** structuring prompts so they're portable across providers. Stable prefix on top, tool definitions in a fixed canonical order, dynamic content at the bottom, no timestamps or UUIDs anywhere in the upper layers. A prompt structured this way caches well on all three providers — you keep optionality without lock-in.

Cache savings checklist — 7 actions

1
Audit your prompts for stable-prefix shape
Pull a representative sample of 100 production prompts. Diff their first 5k tokens. If >10% of the prefix bytes differ across the sample, you have prefix drift and caching will not work until you fix it. Identify the moving parts — timestamps, UUIDs, randomized examples, per-user personalization — and flag each for restructuring.
2
Move stable content to the TOP of the prompt
System prompt, tool definitions, few-shot examples, retrieved corpus all go above the user turn. Personalization, dynamic context, and the user's actual request go below. This is the universal precondition for caching across all three providers — both OpenAI's automatic prefix cache and Anthropic's explicit breakpoints rely on it.
3
Strip timestamps and UUIDs from system prompt
Audit your framework's default system message template. If LangChain, LlamaIndex, your own wrapper, or a third-party agent framework is injecting a current-time or session-id field at the top, override it. If you genuinely need timestamps in the prompt, put them in the user turn instead — they won't break cache there.
4
Add Anthropic cache_control breakpoints (system + tools + examples)
On Anthropic, mark the end of your system prompt, the end of your tool definitions block, and the end of your few-shot examples with cache_control: {type: 'ephemeral'} (use ttl: '1h' for the system + tools layers if your traffic warrants extended TTL). Up to 4 breakpoints total. Each layer caches independently and you get layered savings.
5
Verify cache hits programmatically
Log usage.cache_read_input_tokens (Anthropic), usage.prompt_tokens_details.cached_tokens (OpenAI), or usageMetadata.cachedContentTokenCount (Google) on every response. Track the cached-tokens / total-input-tokens ratio in your observability stack. If the ratio stays below 30% after 7 days of production traffic, you have prefix drift — return to step 1.
6
Set extended TTL for daily-stable contexts
Anthropic: opt into 1-hour TTL for stable system prompts and tool definitions (ttl: '1h'). Google: use explicit contextCache with up to 24-hour TTL for knowledge bases and corpora that change daily at most. The longer TTL means fewer cache writes per day and dramatically better economics for low-frequency-but-recurring traffic.
7
Re-measure monthly cost after 7 days of cache hit data
Pull pre-caching and post-caching billing data side by side. Calculate effective discount = (cached_tokens × discount_rate) / total_tokens. Realistic targets after a clean implementation: Anthropic 60-80% reduction on input costs, Google 50-70%, OpenAI 25-40%. If you're falling short, the most likely culprits are anti-pattern 1 (timestamps) or 4 (per-user prefix fragmentation).

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

Claude pricing 2026 (full breakdown)→OpenAI API pricing 2026→Agent loop cost optimization guide→AI cost trends — 2026 quarterly→Cache-anchored prompt generator→

Frequently Asked Questions

What is prompt caching and why does it save money?

Prompt caching is a provider-side mechanism that stores a hash of your prompt prefix on first request and reuses the cached compute when subsequent requests share that prefix. On a cache hit, you pay a fraction of the normal input-token rate — 10% on Anthropic (90% off), 50% on OpenAI, 25% on Google (75% off). Because most production LLM workloads send the same system prompt + tools + context with only small variations in the user turn, caching often eliminates 60-80% of input billing without changing model behavior or output quality.

Which provider has the best caching discount?

By raw discount, Anthropic at 90% off cached input tokens. But the right answer depends on workload shape. Anthropic's 90% requires explicit breakpoint marking and has a 1.25x first-call surcharge; it pays back hardest on agentic workloads with many reuses. Google's 75% with 24-hour TTL wins for long-document and multimodal workloads. OpenAI's 50% is the lowest discount but requires zero setup and has no write surcharge — it's the right answer for teams that don't want to manage caching architecture explicitly.

Does OpenAI prompt caching require code changes?

No. OpenAI's prompt cache is fully automatic for any request ≥1,024 tokens on a cache-eligible model (gpt-5.4, gpt-5.4-mini, gpt-5.4-turbo, o-series reasoning models, gpt-image-2, gpt-4.1). The 50% discount applies on cache hits without any cache_control parameter or breakpoint marking. The only code change recommended is to log usage.prompt_tokens_details.cached_tokens on each response so you can verify hits and detect prefix drift early.

What's the minimum prompt size to cache?

Anthropic and OpenAI both require ≥1,024 tokens to be cache-eligible. Google requires ≥4,096 tokens for explicit contextCache (implicit cache has no published minimum). Most production prompts clear these bars easily once you count the full system prompt + tool definitions + history. If your typical request is shorter than the minimum, caching is mechanically unavailable on that provider for that workload.

How do I know if my prompts are hitting the cache?

Every provider exposes cache-hit data in the response. Anthropic returns usage.cache_creation_input_tokens (tokens written to cache) and usage.cache_read_input_tokens (tokens served from cache). OpenAI returns usage.prompt_tokens_details.cached_tokens. Google returns usageMetadata.cachedContentTokenCount. Track these fields in your observability stack. Calculate the cached / total input token ratio over time. A healthy production workload with stable prefix architecture should show 60%+ cache hit rate within 7 days of deployment.

Does caching work for multimodal (image/video) inputs?

Yes, on all three providers, but Google has the most comprehensive support. Google caches images, video frames, and audio chunks via explicit contextCache — unique for video-QA and audio analysis workloads. Anthropic caches vision inputs as part of the prompt prefix. OpenAI caches vision inputs and gpt-image-2 prompt prefixes. For pure multimodal cache workloads (cache a 2-hour video once, query it 100 times), Google is the only realistic choice today.

Can I cache across providers — e.g. one cached prompt on Anthropic + OpenAI?

No. Each provider's cache is internal to that provider's infrastructure — caches are not shared across vendors, and a prompt cached on Anthropic produces no benefit when sent to OpenAI or Google (and vice versa). What you CAN do is structure your prompts so the same canonical prefix shape caches well on whichever provider you route to. A prompt with stable system + tools + corpus on top and dynamic content at the bottom hits cache well on all three providers individually — useful when you route different workloads to different providers based on shape.

Caching only works when prompts are structured for it.

Our AI Prompt Generator writes cache-anchored prompts (stable prefix, dynamic suffix, provider-tuned breakpoints) based on YOUR business + task. 14-day free trial.

Browse all prompt tools →