Open Calculator →

Anthropic Prompt Caching Cost: 90% Off Repeated Context

Updated May 2026 · Claude 4 pricing · 6-min read

Anthropic's prompt caching lets you store frequently reused context — system prompts, reference documents, tool definitions — and serve subsequent API calls from that cache at roughly 10% of the standard input token price. For applications with long, stable system prompts or repeated document context, caching is the single most impactful cost-reduction lever available in the Claude API.

This page covers the exact cache write and cache read prices for every Claude 4 model, worked dollar-savings examples for common use cases, and a code snippet showing exactly how to enable caching in your API calls.

Calculate My API Costs →

Claude Prompt Caching Prices — All Models

Cache Token Pricing by Model (per 1M tokens)
Model Standard Input Cache Write Cache Read Savings vs Standard
Claude Haiku 4.5 Cheapest $0.80/M $1.00/M $0.08/M 90% on reads
Claude Sonnet 4.6 Default $3.00/M $3.75/M $0.30/M 90% on reads
Claude Opus 4.7 Most Capable $15.00/M $18.75/M $1.50/M 90% on reads

Cache write tokens are charged at 1.25× the standard input rate — a one-time fee to populate the cache. Cache read tokens are charged at 0.10× the standard input rate — every subsequent call that hits the cache pays only 10% of the full input price. Output tokens are unaffected by caching.

How Claude Prompt Caching Works

Caching operates at the level of individual content blocks in your API request. You mark a block as cacheable, and Anthropic stores everything up to and including that block in a server-side cache keyed to your API account. Here is the three-step flow:

1
First call — cache write
The first API call that includes a cache_control block triggers a cache write. You pay standard input rates for any tokens before the cached block, plus the cache write rate (1.25× input) for the cached block itself. Output tokens are billed normally. The cache is populated and held for 5 minutes (TTL resets on each hit).
2
Subsequent calls within TTL — cache read
Any call made within 5 minutes that includes the same cached prefix is served from cache. The cached block tokens are billed at the cache read rate (0.10× input) — a 90% reduction. New tokens added after the cache breakpoint are billed at the standard input rate. Output tokens are billed normally.
3
After TTL expires — cache write again
If no call references the cached block for 5 minutes, the cache expires. The next call triggers another cache write, and the cycle repeats. Long-running pipelines that need continuous caching should make sure at least one call per 5 minutes references the cached block to keep the TTL alive.
Minimum token requirement: A cacheable block must contain at least 1,024 tokens. Blocks shorter than this threshold are silently ignored — no error is thrown, but no caching occurs either. Always verify your system prompt reaches this minimum before relying on cache savings in production.

Real Savings Calculations

Chatbot with 2,000-token system prompt 1,000 conversations/day · Sonnet 4.6
Without caching — daily input cost $6.00/day
1,000 calls × 2,000 tokens × $3.00/M
With caching — daily input cost $0.606/day
1 write × 2,000 × $3.75/M = $0.0075 + 999 reads × 2,000 × $0.30/M = $0.5994
Monthly savings ~$161/month

Assumes each conversation has exactly one call using the full 2,000-token system prompt and that the cache stays warm throughout the day. Actual savings depend on conversation concurrency and inter-call gaps relative to the 5-minute TTL.

RAG pipeline with 5,000-token document 500 queries/day · Haiku 4.5
Without caching — daily input cost $2.00/day
500 calls × 5,000 tokens × $0.80/M
With caching — daily input cost $0.205/day
1 write × 5,000 × $1.00/M = $0.005 + 499 reads × 5,000 × $0.08/M = $0.1996
Monthly savings ~$54/month

A 5,000-token document comfortably exceeds the 1,024-token minimum. This pattern is ideal for Q&A over a fixed knowledge base, contract analysis, or any pipeline where the same reference document is queried many times in succession.

Claude Code session with 8,000-token CLAUDE.md 100 turns · Sonnet 4.6
Without caching — CLAUDE.md input cost per session $2.40
100 turns × 8,000 tokens × $3.00/M
With caching — CLAUDE.md input cost per session $0.267
1 write × 8,000 × $3.75/M = $0.030 + 99 reads × 8,000 × $0.30/M = $0.2376
Savings per session $2.13 (89% off)

Claude Code automatically caches CLAUDE.md and the accumulated conversation prefix on every turn. An 8,000-token CLAUDE.md is a realistic size for a well-documented project. The savings compound significantly across a full workday of coding sessions.

Enabling Prompt Caching in Python

Add a cache_control block with type: "ephemeral" to any content block you want cached. Everything up to and including that block will be stored in the cache. The example below caches a long system prompt:

Python · Anthropic SDK
import anthropic

client = anthropic.Anthropic()

# Cache the system prompt (must be ≥ 1,024 tokens to activate)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_long_system_prompt,  # ≥ 1,024 tokens
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": user_message}
    ],
)

# Inspect cache usage in the response
usage = response.usage
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens:  {usage.cache_read_input_tokens}")
print(f"Uncached input:     {usage.input_tokens}")

On the first call, cache_creation_input_tokens will be non-zero. On subsequent calls within 5 minutes with the same system prompt, cache_read_input_tokens will be non-zero instead. You can cache up to 4 breakpoints per request, allowing you to cache system prompts, document blocks, and tool definitions independently.

When to Use Caching

Good fit High ROI use cases

  • Long system prompts repeated across every conversation (>1K tokens)
  • RAG / document Q&A — same document, many questions in sequence
  • Agentic loops — tool definitions and growing context reused turn-over-turn
  • Claude Code sessions — CLAUDE.md and file context cached across the session
  • Few-shot prompts with large example banks sent on every request
  • Legal / compliance review — policy documents queried repeatedly

Poor fit Low ROI use cases

  • One-shot queries — no cache hit possible, you only pay the write premium
  • Short system prompts under 1,024 tokens — caching never activates
  • Highly dynamic context that changes on every call
  • Infrequent calls spaced more than 5 minutes apart (TTL expires, repeated writes)
  • Output-heavy workloads where input cost is already a small fraction of the bill
  • Very short conversations where context doesn't accumulate enough to hit the minimum

Frequently Asked Questions

How much do Claude prompt cache tokens cost?

Cache write tokens cost 25% more than standard input tokens — a one-time fee to populate the cache. Cache read tokens cost approximately 10% of the standard input price, roughly 90% cheaper than a fresh input read. Exact rates: Haiku 4.5 cache write $1.00/M, cache read $0.08/M. Sonnet 4.6 cache write $3.75/M, cache read $0.30/M. Opus 4.7 cache write $18.75/M, cache read $1.50/M.

What is the minimum size for Claude prompt caching to activate?

A cacheable block must be at least 1,024 tokens to be eligible for caching. Shorter blocks are not cached regardless of the cache_control setting — no error is thrown, the flag is simply ignored. The cache TTL is 5 minutes by default; subsequent API calls within 5 minutes that include the same cached block will be charged the cache read rate. Each hit resets the TTL.

When does prompt caching save the most money?

Caching saves most when: (1) you have a long system prompt (>1K tokens) that is the same across many conversations, (2) you are doing document or RAG queries where the document stays fixed across multiple questions, (3) you are running agentic loops where tool definitions and context accumulate turn-over-turn. Caching saves least for one-shot queries with no repeated context — in those cases you pay the 1.25× write premium with no offsetting cache hits.

How do I enable prompt caching in the Claude API?

Add a cache_control block with type: "ephemeral" to any content block you want to cache. You can cache system prompts, user messages, tool definitions, and document blocks. The cached block and everything before it in the context window will be cached together. Minimum 1,024 tokens per cacheable block. You can place up to 4 cache breakpoints in a single request, which lets you cache multiple independent sections (e.g. system prompt + document block) in one call.

Does prompt caching work with the Batch API?

Yes. Prompt caching is fully compatible with the Batch API. Cache write tokens are charged at the standard (non-discounted) cache write rate. Cache read tokens are also charged at the standard cache read rate. The 50% Batch API discount applies only to uncached input tokens and output tokens — not to cache write or cache read tokens. Combining caching and the Batch API can yield very large savings on high-volume async workloads with repeated context.