How Anthropic Prompt Caching Works Under the Hood
When you mark a content block with cache_control, Anthropic's infrastructure stores a key-value snapshot of the KV (key-value) cache state at that position in the prompt. On the next API call, if the prefix up to that breakpoint is byte-for-byte identical, the model skips recomputing those tokens and reads from the stored snapshot instead. The result is the same as if you had sent those tokens normally — the model behavior is unchanged — but you pay the cache read rate instead of the standard input rate.
The cache is keyed on the exact token sequence up to the breakpoint. A single character difference — a trailing space, a changed timestamp, a reordered tool definition — causes a cache miss and you pay the standard write rate again. This is the most common source of unexpected cache misses in production: dynamic content mixed into an otherwise-static prefix.
Anthropic supports up to four cache_control breakpoints per request. Each breakpoint checkpoints the prefix at that position. You can nest them: for example, checkpoint after the system prompt, again after the tool list, and again after a retrieved document. On a cache hit, you pay the read rate for everything up to the last matching breakpoint, plus standard rates for any tokens after it.