Layer 1 — Provider-side prompt caching
**Mechanic:** Mark a prompt prefix (system prompt + few-shot examples + long context) as cacheable. The provider stores the computed state on their infrastructure. Subsequent requests with the same prefix hit the cache and skip the redundant compute.
**Anthropic's implementation:** Per Anthropic's prompt caching docs at docs.anthropic.com, cache writes cost 25% more than base input tokens; cache reads cost 10% of base input tokens. Cache TTL: 5 minutes (refreshes on each hit) or 1 hour (separate price tier). Break-even: 2-3 cache hits.
**OpenAI's implementation:** Per OpenAI's prompt caching at platform.openai.com, automatic caching for prompts ≥1024 tokens. Cached tokens cost ~50% of normal input. No explicit cache mark needed; the API caches identical prefixes automatically.
**Google's Gemini context caching:** Per Google's caching docs at ai.google.dev, explicit cache creation via API with minimum 4,096 tokens. Cached tokens at ~25% of normal input. Configurable TTL.
**Best for:** Workloads with long static system prompts + few-shot examples + repetitive large contexts (RAG retrievals that appear in many queries). Typical savings: 60-80% on input cost for cache-eligible workloads.