Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Prompt Caching Tutorial Anthropic

A complete, technical walkthrough of Anthropic's prompt caching system — cache_control breakpoints, TTL options, minimum token counts, pricing math, and production-ready Python code. This is the runbook, not the marketing overview.

By DDH Research Team at Digital Dashboard HubUpdated

Prompt caching is the single highest-ROI optimization available on the Anthropic API today. Cache reads cost 0.1x the standard input token price — a flat 90% discount — on every token inside a cached prefix. If your application sends a system prompt, tool definitions, a long document, or a set of few-shot examples on every call, you are paying full price for tokens that could cost one-tenth as much.

This tutorial covers every detail you need to ship caching in production: the cache_control syntax, the two available TTL tiers, per-model minimum token requirements, the cache write surcharge, the order-of-prefix rule, a worked dollar-savings example for an agent loop, and the five pitfalls that cause cache misses in real code. Before you read further, open our AI Prompt Cost Calculator in a new tab — it lets you paste your current monthly token volume and see the before/after cost across every Claude model.

If you are working through a broader cost reduction program, start with the AI cost optimization checklist for 2026, which ranks prompt caching as item 1 out of 17 by savings-to-effort ratio. For per-model price lookup, see Anthropic Claude pricing 2026.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card — AICHAT30 = 30% off Pro.

Prompt caching quick-reference: limits, pricing, and TTL by model

Feature
Model
Min cacheable tokens
Cache write cost
Cache read cost
Default TTL
Extended TTL
Claude Opus 4.81,024 tokens1.25x standard input0.1x standard input (90% off)5 minutes1 hour (ephemeral extended)
Claude Sonnet 4.61,024 tokens1.25x standard input0.1x standard input (90% off)5 minutes1 hour (ephemeral extended)
Claude Haiku 4.52,048 tokens1.25x standard input0.1x standard input (90% off)5 minutes1 hour (ephemeral extended)

Pricing sourced from docs.anthropic.com/en/docs/build-with-claude/prompt-caching and platform.anthropic.com as of June 2026. Cache write cost is charged once per TTL window. Cache reads are charged at 0.1x for every subsequent call that hits the same cached prefix.

How Anthropic Prompt Caching Works Under the Hood

When you mark a content block with cache_control, Anthropic's infrastructure stores a key-value snapshot of the KV (key-value) cache state at that position in the prompt. On the next API call, if the prefix up to that breakpoint is byte-for-byte identical, the model skips recomputing those tokens and reads from the stored snapshot instead. The result is the same as if you had sent those tokens normally — the model behavior is unchanged — but you pay the cache read rate instead of the standard input rate.

The cache is keyed on the exact token sequence up to the breakpoint. A single character difference — a trailing space, a changed timestamp, a reordered tool definition — causes a cache miss and you pay the standard write rate again. This is the most common source of unexpected cache misses in production: dynamic content mixed into an otherwise-static prefix.

Anthropic supports up to four cache_control breakpoints per request. Each breakpoint checkpoints the prefix at that position. You can nest them: for example, checkpoint after the system prompt, again after the tool list, and again after a retrieved document. On a cache hit, you pay the read rate for everything up to the last matching breakpoint, plus standard rates for any tokens after it.


The cache_control Syntax: Ephemeral Type and Breakpoint Placement

The cache_control field is a JSON object with a single key, type, currently set to "ephemeral". It is attached to a content block — either a text block inside the messages array or the system field. The ephemeral type tells Anthropic to cache this prefix and use the default 5-minute TTL unless you also request the extended 1-hour TTL via the anthropic-beta header.

Here is the minimal Python example using the anthropic SDK. It caches a large system prompt and a retrieved document, then sends a short user question: import anthropic client = anthropic.Anthropic() SYSTEM_PROMPT = """You are a helpful assistant... (imagine 2,000 tokens of stable instructions here)""" DOCUMENT = """(imagine a 5,000-token PDF extract here)""" response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"} } ], messages=[ { "role": "user", "content": [ { "type": "text", "text": DOCUMENT, "cache_control": {"type": "ephemeral"} }, { "type": "text", "text": "Summarize the key findings in three bullet points." } ] } ] ) print(response.usage) # shows cache_creation_input_tokens vs cache_read_input_tokens

The response.usage object contains three keys you care about: input_tokens (tokens not cached), cache_creation_input_tokens (tokens written to cache this call), and cache_read_input_tokens (tokens read from cache, billed at 0.1x). On the first call you see cache_creation_input_tokens > 0 and pay 1.25x for those. On every subsequent call within the TTL window, cache_read_input_tokens is populated and you pay 0.1x. Log these values in production to confirm your cache hit rate.


Minimum Cacheable Tokens: Why Your Small Prompts Miss

Anthropic enforces a minimum prefix length before the cache_control breakpoint: 1,024 tokens for Claude Opus 4.8 and Claude Sonnet 4.6, and 2,048 tokens for Claude Haiku 4.5. If the content before your breakpoint is shorter than this threshold, Anthropic silently ignores the cache_control field — no error is returned, and cache_creation_input_tokens stays at zero.

This catches teams off guard when they test caching with a short system prompt and see no cache savings. The fix is to check your usage object after the first call. If cache_creation_input_tokens is 0 and you set cache_control, your prefix is below the minimum. Options: expand your system prompt to meet the threshold, combine your system prompt with your tool definitions into a single cached block, or pre-load a few-shot example block to push the total over the limit.

The minimum is measured in tokens, not characters. A 1,024-token English prefix is roughly 800-900 words. If your system prompt is shorter, adding even a few stable paragraphs of context — your product description, a style guide, a personas section — gets you over the line and starts generating cache savings immediately.


TTL Options: 5-Minute Default vs 1-Hour Extended Cache

By default, cached prefixes expire after 5 minutes of inactivity. The TTL resets on every cache read, so if you are making calls at least once every 5 minutes the cache stays warm indefinitely. For high-frequency applications — chatbots, interactive agents, live dashboards — the 5-minute TTL is sufficient and costs nothing extra.

For lower-frequency workflows — document review queues, overnight batch processing, or scheduled jobs — 5 minutes is not enough. Anthropic offers an extended 1-hour TTL via the anthropic-beta: prompt-caching-2024-07-31 header combined with a cache_control type of "ephemeral". When using extended TTL, the cache stays warm for up to 60 minutes of inactivity, which is sufficient for most async processing pipelines.

To request extended TTL in Python: response = client.messages.create( model="claude-opus-4-8", max_tokens=1024, extra_headers={ "anthropic-beta": "extended-cache-ttl-2025-02-19" }, system=[ { "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"} } ], messages=[...] ) The extended TTL is available on all models that support prompt caching. Check docs.anthropic.com for the current beta header string, as it may be promoted to stable and the header name will change.


Cache Write vs Cache Read Pricing: The Break-Even Math

Cache writes cost 1.25x the standard input token price. Cache reads cost 0.1x. The break-even point — where the cache write cost is paid back by read savings — is after just 2 reads of the same prefix. On the third read onward, every read saves 90% vs the standard rate.

Worked example using Claude Sonnet 4.6 pricing (assume $3.00 per million input tokens as of June 2026). A system prompt of 4,000 tokens: - Without caching: 10 calls × 4,000 tokens × $3.00/1M = $0.12 - With caching: 1 write at 1.25x ($3.75/1M) + 9 reads at 0.1x ($0.30/1M) = (4,000 × $3.75/1M) + 9 × (4,000 × $0.30/1M) = $0.015 + $0.0108 = $0.026 - Savings: $0.12 → $0.026 = **78% reduction on 10 calls** At 100 calls per session, the savings approach the theoretical 90% maximum because the one-time write cost becomes negligible.

The write cost matters most for single-call workflows — if you only call the API once with a given prefix, caching costs 25% more than not caching. Always confirm your application will make at least 3 calls with the same prefix before enabling cache_control on a block. For agent loops, RAG pipelines, and multi-turn chat, this threshold is almost always met.


What to Cache: System Prompts, Tool Definitions, Documents, Few-Shot Examples

**System prompts** are the canonical first thing to cache. A 2,000-token system prompt that accompanies every API call in your application is the easiest win: stable content, maximum reuse, zero user-facing change. Place a cache_control breakpoint at the end of your system prompt block and every subsequent call in the same TTL window pays 0.1x on those tokens.

**Tool definitions** are the second-highest-value target. If your agent uses a fixed set of tools — a web search tool, a code execution tool, a database query tool — those definitions can be 500-2,000 tokens each and are resent on every call in an agent loop. Cache them by placing the cache_control breakpoint after the last tool definition. Combined with a cached system prompt, caching tools typically covers 60-80% of total input tokens in an agentic workflow. See the agent loop cost optimization guide for a full breakdown of where tokens go in multi-step agents.

**Retrieved documents** from RAG pipelines are the third target, but with a caveat: they are only cacheable if the same document is retrieved on multiple calls within the TTL window. If each user query retrieves a different document, there is no cache benefit. The pattern that works: cache the document on the first query about it, then keep the same document pinned across the conversation for follow-up questions. For static knowledge bases — a product manual, a legal agreement, a technical specification — cache the entire document at the start of the session.

**Few-shot examples** are often overlooked. If you include 5-10 examples in every prompt to steer output format, those examples may be 1,000-3,000 tokens. They are completely static and cache perfectly. Place your cache breakpoint after the examples and before the user's actual input.


Order-of-Prefix Rules: Why Sequence Matters

Anthropic's cache is prefix-keyed: a cache hit only occurs if the token sequence from position 0 up to the breakpoint is byte-for-byte identical to a previously cached sequence. This means the order of your content blocks determines whether caching works.

The correct ordering for maximum cache efficiency is: (1) static system prompt, (2) static tool definitions, (3) static documents or few-shot examples, (4) dynamic user message. Each cacheable block should have its own cache_control breakpoint. The dynamic user message must come last — any dynamic content before a cache breakpoint invalidates the cache for all breakpoints that follow it.

A common mistake: inserting a timestamp, request ID, or user-specific variable inside the system prompt. Example: # WRONG — timestamp breaks cache system = f"You are a helpful assistant. Current time: {datetime.now()}" # RIGHT — move dynamic content to the user message system = "You are a helpful assistant." user_message = f"Current time: {datetime.now()}. User query: {query}" If you need to pass session-specific data to the model, always place it in the user message (or assistant turn), never in the system prompt or any block that carries a cache_control breakpoint.


Worked Dollar-Savings Example: An Agent Loop

Consider a production agent that answers customer support questions. Each session consists of 15 API calls. The stable context per call: 3,000-token system prompt + 1,500 tokens of tool definitions + 4,000 tokens of retrieved knowledge base articles = 8,500 stable input tokens per call. Variable context per call: ~200 tokens of conversation history + ~100 tokens of user message = ~300 variable tokens per call.

Without caching, using Claude Sonnet 4.6 at $3.00/1M input: - 15 calls × 8,800 tokens × $3.00/1M = $0.396 per session - At 1,000 sessions/month: $396/month in input costs alone With caching (1 write + 14 reads per session, assuming the 5-min TTL holds across a typical 10-minute session): - Cache write: 8,500 tokens × $3.75/1M = $0.032 - 14 cache reads: 8,500 × 14 × $0.30/1M = $0.036 - Variable tokens (no cache): 300 × 15 × $3.00/1M = $0.014 - Total per session: $0.082 - At 1,000 sessions/month: $82/month **Result: $396 → $82/month, a 79% reduction. $3,768 saved annually.** For larger agent deployments running tens of thousands of sessions per month, the savings compound further. Check the AI Prompt Cost Calculator to run this math against your own session volume.

If you also use the Batch API for asynchronous sessions (see batch API savings calculator), you can stack an additional 50% off output tokens on top of the caching savings, bringing the combined cost reduction to over 85% vs the naive synchronous + uncached baseline.


Caching Tool Definitions in Agent Loops: Python Code

Tool definitions are one of the highest-value caching targets in agentic code because they are resent on every turn of the loop, they are large (a fully-typed tool schema is typically 300-800 tokens per tool), and they never change mid-session. Here is a minimal example caching both the system prompt and the tool list: import anthropic client = anthropic.Anthropic() tools = [ { "name": "web_search", "description": "Search the web for current information.", "input_schema": { "type": "object", "properties": { "query": {"type": "string", "description": "The search query."} }, "required": ["query"] } }, { "name": "run_code", "description": "Execute Python code and return stdout.", "input_schema": { "type": "object", "properties": { "code": {"type": "string", "description": "Python code to run."} }, "required": ["code"] } } ] def call_agent(user_message: str, conversation_history: list) -> dict: response = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, system=[ { "type": "text", "text": "You are a helpful agent with web search and code execution.", "cache_control": {"type": "ephemeral"} } ], tools=tools, # Note: tool-level cache_control is set via betas messages=[ *conversation_history, {"role": "user", "content": user_message} ] ) return response

To cache tool definitions specifically, Anthropic supports cache_control on individual tool objects via the tools parameter. Add the breakpoint to the last tool in your list to cache all preceding tools as a unit: tools_with_cache = [ {"name": "web_search", ...}, # no cache_control needed { "name": "run_code", ..., "cache_control": {"type": "ephemeral"} # caches all tools up to this point } ] This approach is documented in the Anthropic prompt caching reference. Combining system prompt caching with tool definition caching is the standard pattern for agent loops and is what the agent loop cost optimization guide recommends as the baseline configuration before any other optimization.


Common Pitfalls That Cause Cache Misses

**Dynamic content in cached blocks.** The most common production mistake. Any variable that changes between calls — timestamps, request IDs, user names, session tokens, current dates — placed inside a cached block will cause a cache miss on every call. Audit every f-string and string.format() call in your prompt construction code. If the variable changes, it must be in a non-cached block.

**Reordering messages or tools between calls.** The cache key is the full token sequence. If your code builds the tool list from a dictionary and the iteration order is non-deterministic (Python dicts are ordered as of 3.7, but set comprehensions and other patterns are not), a reordered tool list will miss the cache even though the content is identical. Always use a deterministic list for your tool definitions.

**Falling below the minimum token threshold.** As noted in the minimum token section above, prefixes shorter than 1,024 tokens (Opus/Sonnet) or 2,048 tokens (Haiku) will not be cached. The API does not return an error — it silently ignores the cache_control field. Check cache_creation_input_tokens in your first response to confirm the write happened.

**Forgetting to set the beta header for extended TTL.** If your sessions are longer than 5 minutes and you are not seeing cache hits on later turns, you are likely missing the extended-cache-ttl beta header. Add it to every call in the session, not just the first one.

**Using prompt caching with the Batch API without verifying TTL overlap.** The Batch API can process requests over a window of up to 24 hours. If your batch window is longer than the cache TTL, requests near the end of the batch will miss the cache. For batch workloads where caching is important, structure your batches so all requests using the same cached prefix are submitted close together in time, or accept that the cache savings only apply to a subset of the batch. See the batch API savings calculator for combined savings modeling.


Comparing Caching Strategies: When to Use Each Pattern

**Single-document QA (cache the document):** A user uploads a PDF and asks multiple questions about it. Cache the extracted document text on the first question. All follow-up questions in the same session hit the cache. Savings: 70-90% on input tokens for sessions with 3+ questions. The document must be placed before the user's question in the message array and must not change between turns.

**Multi-turn chatbot (cache the system prompt only):** A customer support bot with a 2,000-token system prompt and short conversation turns. Cache the system prompt; let the conversation history accumulate as standard tokens. Savings: 30-50% on input tokens, depending on system prompt size relative to conversation length. This is the simplest pattern and the right starting point for most teams. Cross-reference function calling vs structured output if you are also deciding how to parse model responses.

**Agent loop (cache system + tools + context):** Multiple cache breakpoints — after system prompt, after tool definitions, and optionally after a pinned context document. This is the highest-complexity pattern but also the highest savings, typically 75-90% on input tokens for loops with 10+ turns. Combine with model tiering (using Claude Haiku 4.5 for tool calls that don't require reasoning) for maximum cost reduction. The agent loop cost optimization guide covers the full pattern.

**Batch classification (limited caching value):** If each item in a batch has a different document or context, caching provides minimal benefit since each request has a unique prefix. The exception: if all batch items share the same system prompt and instructions, cache the shared prefix and let the per-item content be uncached. Even a 2,000-token shared system prompt cached across 10,000 batch requests saves $0.60 per 1M tokens compared to no caching.


Verifying Cache Hits in Production: Monitoring and Logging

Every API response from Anthropic includes a usage object with cache-specific fields. Log these on every call in production: usage = response.usage print({ "input_tokens": usage.input_tokens, "cache_creation_input_tokens": usage.cache_creation_input_tokens, "cache_read_input_tokens": usage.cache_read_input_tokens, "output_tokens": usage.output_tokens }) A healthy cache hit looks like: input_tokens ~300, cache_creation_input_tokens 0, cache_read_input_tokens ~8500. A cache miss (first call in a session) looks like: input_tokens ~300, cache_creation_input_tokens ~8500, cache_read_input_tokens 0.

Build a cache hit rate metric in your observability stack: hit_rate = cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens). A well-configured agent loop should show a hit rate above 90% once the system is warmed. A hit rate below 80% suggests dynamic content is leaking into cached blocks, the TTL is expiring between calls, or the minimum token threshold is not being met on some requests.

Use this metric to gate cost alerts: if your monthly AI spend increases while usage is flat, a drop in cache hit rate is often the cause. Compare cache_read_input_tokens vs cache_creation_input_tokens in your billing period totals. For full cost attribution across models and call types, the Anthropic Claude pricing 2026 guide covers how to map these usage fields to dollar amounts.

Frequently Asked Questions

Does prompt caching change what the model outputs?

No. A cache hit returns the same model response as if the tokens had been processed normally. The cached representation is the internal KV state, not a stored output. Temperature, sampling, and all other generation parameters apply normally to the non-cached portion of the prompt.

What is the minimum prefix size to enable caching on Claude Haiku 4.5?

Claude Haiku 4.5 requires a minimum of 2,048 tokens before the cache_control breakpoint. Claude Opus 4.8 and Claude Sonnet 4.6 require 1,024 tokens. If your prefix is shorter, the cache_control field is silently ignored and cache_creation_input_tokens will be 0 in the response usage object.

How much does a cache write cost vs a cache read?

Cache writes cost 1.25x the standard input token price (a 25% surcharge). Cache reads cost 0.1x the standard input token price (90% off). The write cost is paid back after just 2 reads. On the third read and beyond, every read is 90% cheaper than a standard call.

What is the difference between the 5-minute TTL and the 1-hour extended TTL?

The default TTL is 5 minutes of inactivity — the cache stays warm as long as you call the API at least once every 5 minutes. The extended 1-hour TTL is requested via an anthropic-beta header and keeps the cache warm for up to 60 minutes of inactivity. Use the extended TTL for lower-frequency workflows where 5-minute intervals are not guaranteed.

Can I use prompt caching with the Batch API?

Yes, but with a caveat: the cache TTL must overlap the batch processing window. If your batch takes longer than 1 hour to process and requests are spread across that window, later requests may miss the cache. For batch workloads with large shared prefixes, group requests with the same cached prefix into the same batch submission and submit them close in time.

How many cache_control breakpoints can I set per request?

Anthropic supports up to four cache_control breakpoints per request. Each breakpoint checkpoints the prefix at that position. You can use all four: one after the system prompt, one after tool definitions, one after a retrieved document, and one after few-shot examples. Beyond four, additional breakpoints are ignored.

Will caching work if I change one tool definition but not others?

If you change any content before or at the cache breakpoint, the entire prefix from that point forward misses the cache. If your tools are ordered consistently and you change only the last tool (the one that carries the cache_control breakpoint), the cache miss applies to the full tool block. A workaround: place the cache breakpoint after only the stable tools and leave variable tools after the breakpoint without cache_control.

Does DDH's AI Prompt Cost Calculator account for prompt caching savings?

Yes — the calculator has a caching toggle that applies the 0.1x read rate to a configurable percentage of input tokens, so you can model before/after scenarios. Enter your monthly session count, average tokens per call, and expected cache hit rate to get an accurate projection.

See your exact caching savings before writing a line of code.

Paste your monthly token volume into the [AI Prompt Cost Calculator](/blog/ai-prompt-cost-calculator) — toggle caching on and off to see the before/after cost across every Claude model at your actual usage level. Then grab a cost-optimized prompt from DDH Pro, already tuned for the model tier you select.

Browse all prompt tools →