7 ways to reduce your Claude API costs

Updated May 2026 • Claude 4 pricing • 5-min read

Anthropic's Claude API is powerful but the bill can surprise you. Here are seven concrete techniques developers use to cut Claude API costs by 40–90%, with real dollar examples for each.

Jump to:

Prompt caching — up to 90% off repeated context
Batch API — 50% off asynchronous workloads
Model tiering — use the right model per task
Token compression — shrink prompts before they're sent
Output token limits — stop runaway completions
Streaming vs polling — don't wait to save
Measure first — you can't cut what you can't see

1. Prompt caching Easy

Up to 90% savings on repeated context

If your application prepends a large system prompt or document to every call, prompt caching is the single biggest lever you have. Anthropic charges cache-write tokens at 1.25× the normal input rate, then cache-read tokens at just 10% of the normal input rate.

Example: You have a 50,000-token legal document you pass on every analysis call. At Sonnet 4.6 ($3/M input), that's $0.15 per call uncached. With caching, the first call is $0.1875 (cache-write), but every subsequent call costs just $0.015 — a 90% reduction from call 2 onwards.

Enable caching by adding "cache_control": {"type": "ephemeral"} to the last block you want cached in your system array. The cache lives for 5 minutes by default; long-running pipelines should refresh it periodically.

Token type	Rate (vs normal input)	When charged
cache_write	1.25×	First call that fills the cache
cache_read	0.10×	Every subsequent call while cached
Output tokens	1.00× (normal output rate)	Not affected by caching

2. Batch API Easy

50% off — guaranteed

Anthropic's Message Batches API gives a flat 50% discount on every token (input and output) for requests that don't need a real-time response. Results are available within 24 hours, usually much sooner.

Batch API is perfect for:

Document classification pipelines
Overnight data enrichment jobs
Generating product descriptions / embeddings in bulk
Evaluation runs that compare model outputs

Example: Processing 10,000 short documents with Haiku 4.5 at $0.80/M input. Real-time: $0.008 each = $80 total. Batch: $0.004 each = $40 total. If you run this daily, that's $14,600/year saved.

3. Model tiering Easy

60–95% savings vs. Opus

Not every task needs the most capable model. Routing simpler subtasks to Haiku while reserving Opus or Sonnet for reasoning-heavy steps can cut costs dramatically.

Model	Input $/M	Output $/M	Best for
Claude Haiku 4.5	$0.80	$4.00	Classification, extraction, simple Q&A
Claude Sonnet 4.6	$3.00	$15.00	Code generation, reasoning, summarisation
Claude Opus 4.7	$15.00	$75.00	Complex reasoning, multi-step agents

A practical tiering strategy: try Haiku first and escalate to Sonnet only when confidence is low (e.g. model returns "I don't know" or a low-confidence marker you define). Many production teams keep 70–80% of their traffic on Haiku this way.

4. Token compression Medium

10–40% prompt reduction

Every token costs money. Tightening your prompts — removing redundant instructions, whitespace, and verbose explanations — directly lowers every call.

High-ROI compression techniques:

Remove filler phrases: "Please make sure to always…" → "Always…"
Use structured formats: JSON/XML schemas instead of prose descriptions of the desired output.
Chunk large documents: Pass only the relevant section, not the full file.
Trim chat history: Keep only the last N turns rather than the full conversation; use a rolling summary for older context.
Deduplicate system prompt boilerplate: If multiple tools share the same base instructions, cache them once (see Tip 1).

Example: A 2,000-token system prompt trimmed to 1,200 tokens saves 800 tokens per call. At Sonnet pricing ($3/M), that's $0.0024 per call. Across 100,000 calls/month: $240/month saved with no model change.

5. Output token limits Easy

Variable — catches runaway completions

Output tokens are priced 4–5× higher than input tokens. Setting max_tokens explicitly prevents open-ended completions from blowing out your budget on a single call.

Guidelines by use case:

Task type	Suggested max_tokens
Classification / label only	10–50
Short answer / yes-no	50–200
Summarisation (paragraph)	200–500
Code generation (function)	500–1500
Long-form document draft	2000–4000

Also consider asking the model to "be concise" or "answer in one sentence" — this reduces actual output length independently of max_tokens, and reduced output means a lower bill.

6. Streaming — use it for UX, not billing Medium

No direct cost — but prevents ghost tokens

Streaming itself doesn't reduce billed tokens. But it surfaces partial responses earlier, letting you cancel a generation early if the model goes off-track — saving the trailing output tokens.

If you stream and the first 50 tokens show the model is hallucinating or going off-topic, call stream.abort(). You're only billed for tokens generated up to that point, not the full completion. For long code-gen tasks this can save 500–2,000 output tokens per bad completion.

7. Measure first — you can't cut what you can't see Advanced

Prerequisite for everything else

Before optimizing, instrument your usage. Log usage.input_tokens, usage.output_tokens, usage.cache_read_input_tokens, and usage.cache_creation_input_tokens from every API response. Aggregate them by endpoint, user, or feature flag.

Use the Claude Code Cost Calculator to parse Claude Code session logs and see exactly where your tokens are going — by model, by tool call, and by hour. Paste your .jsonl log file and get an instant breakdown.

See where your tokens are going

Paste a Claude Code session log to get a free, instant cost breakdown by model, tool, and hour.

Open the free calculator →

Summary: quick wins vs. deep optimizations

Technique	Effort	Typical saving	Best when
Prompt caching	Low	50–90%	Large repeated context
Batch API	Low	50%	Async workloads
Model tiering	Low	60–95%	Mixed-complexity tasks
Token compression	Medium	10–40%	Verbose system prompts
Output limits	Low	Variable	All production calls
Streaming abort	Medium	Variable	Long code-gen / agents
Usage instrumentation	Medium	Prerequisite	Any serious workload