How each provider's cache actually works
**Anthropic — explicit, opt-in, you choose what gets cached.** You insert a `cache_control: {type: 'ephemeral'}` block at one of up to four breakpoints in your messages array. Everything *before* the breakpoint is hashed, written to cache on first request, and reused on subsequent requests within the TTL window. The 5-minute default TTL is automatic; the 1-hour extended TTL is opt-in with `ttl: '1h'` and doubles the storage cost (which is baked into the 1.25x cache-write surcharge applied at write time, not separately billed). Cached input reads at 0.1x base input rate — the headline 90% discount.
**OpenAI — automatic, no opt-in, exact prefix match.** Any prompt ≥1,024 tokens routed to a cache-eligible model (gpt-5.4, gpt-5.4-mini, gpt-5.4-turbo, o-series reasoning models, gpt-image-2, gpt-4.1) is automatically hashed in 128-token increments. If a subsequent request shares an exact prefix, the matched portion bills at 0.5x — the 50% discount. There is no `cache_control` parameter, no breakpoint marking, no API surface to control it. You can only verify it happened after the fact via `usage.prompt_tokens_details.cached_tokens` in the response.
**Google — implicit + explicit, two modes.** Implicit caching is automatic and free: Gemini opportunistically caches portions of prompts that look reusable. When implicit cache hits, you pay nothing extra and get the 75% discount on the cached portion. Implicit cache has no guarantees — it might hit, it might not. Explicit `contextCache` is the paid mode: you POST a chunk of content (system prompt + tools + documents + few-shots), get back a cache name, and reference that name in subsequent requests. Explicit cache guarantees the discount, supports up to 24-hour TTL, and bills storage at $1 per million tokens per hour. Minimum cacheable content is 4,096 tokens (significantly higher than Anthropic/OpenAI's 1,024).
The headline implication: Anthropic gives you the most control and the deepest discount but requires deliberate prompt architecture. OpenAI is the lowest-friction option but you have zero levers when it misses. Google is the most flexible (free implicit + paid explicit) and is the only realistic choice for multimodal cache (cached video frames, audio chunks, image arrays).