What is prompt caching and how does it save money?
When you send a prompt, the model processes every input token before it generates a single output token. If your prompt begins with the same 4,000-token system instruction and reference document on every call, you pay to reprocess those same tokens every single time. Prompt caching breaks that cycle: the provider stores the processed prefix and, on later calls that start with the identical prefix, reuses it instead of recomputing it.
The savings come from two places. First, cached input tokens bill at a reduced rate compared to standard input tokens — check the live cached-read rate on the Anthropic pricing page or the OpenAI pricing page, since the exact discount varies by provider. Second, you cut latency, because the model skips the work of re-encoding the prefix. For high-volume apps, the input side of the bill is often the dominant cost, so trimming it is where the real money is.
The catch: caching only helps when the prefix is genuinely stable and reused within the cache's lifetime. A cache entry has a limited time-to-live; if calls are too far apart, the entry expires and you pay full price again. That makes prompt caching ideal for bursty, repeated workloads and less useful for one-off requests.