Anthropic Prompt Caching Cost: 90% Off Repeated Context
Anthropic's prompt caching lets you store frequently reused context — system prompts, reference documents, tool definitions — and serve subsequent API calls from that cache at roughly 10% of the standard input token price. For applications with long, stable system prompts or repeated document context, caching is the single most impactful cost-reduction lever available in the Claude API.
This page covers the exact cache write and cache read prices for every Claude 4 model, worked dollar-savings examples for common use cases, and a code snippet showing exactly how to enable caching in your API calls.
Calculate My API Costs →Claude Prompt Caching Prices — All Models
| Model | Standard Input | Cache Write | Cache Read | Savings vs Standard |
|---|---|---|---|---|
| Claude Haiku 4.5 Cheapest | $0.80/M | $1.00/M | $0.08/M | 90% on reads |
| Claude Sonnet 4.6 Default | $3.00/M | $3.75/M | $0.30/M | 90% on reads |
| Claude Opus 4.7 Most Capable | $15.00/M | $18.75/M | $1.50/M | 90% on reads |
Cache write tokens are charged at 1.25× the standard input rate — a one-time fee to populate the cache. Cache read tokens are charged at 0.10× the standard input rate — every subsequent call that hits the cache pays only 10% of the full input price. Output tokens are unaffected by caching.
How Claude Prompt Caching Works
Caching operates at the level of individual content blocks in your API request. You mark a block as cacheable, and Anthropic stores everything up to and including that block in a server-side cache keyed to your API account. Here is the three-step flow:
cache_control block triggers a cache write. You pay standard input rates for any tokens before the cached block, plus the cache write rate (1.25× input) for the cached block itself. Output tokens are billed normally. The cache is populated and held for 5 minutes (TTL resets on each hit).
Real Savings Calculations
Assumes each conversation has exactly one call using the full 2,000-token system prompt and that the cache stays warm throughout the day. Actual savings depend on conversation concurrency and inter-call gaps relative to the 5-minute TTL.
A 5,000-token document comfortably exceeds the 1,024-token minimum. This pattern is ideal for Q&A over a fixed knowledge base, contract analysis, or any pipeline where the same reference document is queried many times in succession.
Claude Code automatically caches CLAUDE.md and the accumulated conversation prefix on every turn. An 8,000-token CLAUDE.md is a realistic size for a well-documented project. The savings compound significantly across a full workday of coding sessions.
Enabling Prompt Caching in Python
Add a cache_control block with type: "ephemeral" to any content block you want cached. Everything up to and including that block will be stored in the cache. The example below caches a long system prompt:
import anthropic client = anthropic.Anthropic() # Cache the system prompt (must be ≥ 1,024 tokens to activate) response = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=[ { "type": "text", "text": your_long_system_prompt, # ≥ 1,024 tokens "cache_control": {"type": "ephemeral"}, } ], messages=[ {"role": "user", "content": user_message} ], ) # Inspect cache usage in the response usage = response.usage print(f"Cache write tokens: {usage.cache_creation_input_tokens}") print(f"Cache read tokens: {usage.cache_read_input_tokens}") print(f"Uncached input: {usage.input_tokens}")
On the first call, cache_creation_input_tokens will be non-zero. On subsequent calls within 5 minutes with the same system prompt, cache_read_input_tokens will be non-zero instead. You can cache up to 4 breakpoints per request, allowing you to cache system prompts, document blocks, and tool definitions independently.
When to Use Caching
Good fit High ROI use cases
- ✓Long system prompts repeated across every conversation (>1K tokens)
- ✓RAG / document Q&A — same document, many questions in sequence
- ✓Agentic loops — tool definitions and growing context reused turn-over-turn
- ✓Claude Code sessions — CLAUDE.md and file context cached across the session
- ✓Few-shot prompts with large example banks sent on every request
- ✓Legal / compliance review — policy documents queried repeatedly
Poor fit Low ROI use cases
- ✗One-shot queries — no cache hit possible, you only pay the write premium
- ✗Short system prompts under 1,024 tokens — caching never activates
- ✗Highly dynamic context that changes on every call
- ✗Infrequent calls spaced more than 5 minutes apart (TTL expires, repeated writes)
- ✗Output-heavy workloads where input cost is already a small fraction of the bill
- ✗Very short conversations where context doesn't accumulate enough to hit the minimum
Frequently Asked Questions
How much do Claude prompt cache tokens cost?
Cache write tokens cost 25% more than standard input tokens — a one-time fee to populate the cache. Cache read tokens cost approximately 10% of the standard input price, roughly 90% cheaper than a fresh input read. Exact rates: Haiku 4.5 cache write $1.00/M, cache read $0.08/M. Sonnet 4.6 cache write $3.75/M, cache read $0.30/M. Opus 4.7 cache write $18.75/M, cache read $1.50/M.
What is the minimum size for Claude prompt caching to activate?
A cacheable block must be at least 1,024 tokens to be eligible for caching. Shorter blocks are not cached regardless of the cache_control setting — no error is thrown, the flag is simply ignored. The cache TTL is 5 minutes by default; subsequent API calls within 5 minutes that include the same cached block will be charged the cache read rate. Each hit resets the TTL.
When does prompt caching save the most money?
Caching saves most when: (1) you have a long system prompt (>1K tokens) that is the same across many conversations, (2) you are doing document or RAG queries where the document stays fixed across multiple questions, (3) you are running agentic loops where tool definitions and context accumulate turn-over-turn. Caching saves least for one-shot queries with no repeated context — in those cases you pay the 1.25× write premium with no offsetting cache hits.
How do I enable prompt caching in the Claude API?
Add a cache_control block with type: "ephemeral" to any content block you want to cache. You can cache system prompts, user messages, tool definitions, and document blocks. The cached block and everything before it in the context window will be cached together. Minimum 1,024 tokens per cacheable block. You can place up to 4 cache breakpoints in a single request, which lets you cache multiple independent sections (e.g. system prompt + document block) in one call.
Does prompt caching work with the Batch API?
Yes. Prompt caching is fully compatible with the Batch API. Cache write tokens are charged at the standard (non-discounted) cache write rate. Cache read tokens are also charged at the standard cache read rate. The 50% Batch API discount applies only to uncached input tokens and output tokens — not to cache write or cache read tokens. Combining caching and the Batch API can yield very large savings on high-volume async workloads with repeated context.