7 ways to reduce your Claude API costs

Updated May 2026 • Claude 4 pricing • 5-min read

Anthropic's Claude API is powerful but the bill can surprise you. Here are seven concrete techniques developers use to cut Claude API costs by 40–90%, with real dollar examples for each.

Jump to:
  1. Prompt caching — up to 90% off repeated context
  2. Batch API — 50% off asynchronous workloads
  3. Model tiering — use the right model per task
  4. Token compression — shrink prompts before they're sent
  5. Output token limits — stop runaway completions
  6. Streaming vs polling — don't wait to save
  7. Measure first — you can't cut what you can't see

1. Prompt caching Easy

Up to 90% savings on repeated context

If your application prepends a large system prompt or document to every call, prompt caching is the single biggest lever you have. Anthropic charges cache-write tokens at 1.25× the normal input rate, then cache-read tokens at just 10% of the normal input rate.

Example: You have a 50,000-token legal document you pass on every analysis call. At Sonnet 4.6 ($3/M input), that's $0.15 per call uncached. With caching, the first call is $0.1875 (cache-write), but every subsequent call costs just $0.015 — a 90% reduction from call 2 onwards.

Enable caching by adding "cache_control": {"type": "ephemeral"} to the last block you want cached in your system array. The cache lives for 5 minutes by default; long-running pipelines should refresh it periodically.

Token typeRate (vs normal input)When charged
cache_write1.25×First call that fills the cache
cache_read0.10×Every subsequent call while cached
Output tokens1.00× (normal output rate)Not affected by caching

2. Batch API Easy

50% off — guaranteed

Anthropic's Message Batches API gives a flat 50% discount on every token (input and output) for requests that don't need a real-time response. Results are available within 24 hours, usually much sooner.

Batch API is perfect for:

Example: Processing 10,000 short documents with Haiku 4.5 at $0.80/M input. Real-time: $0.008 each = $80 total. Batch: $0.004 each = $40 total. If you run this daily, that's $14,600/year saved.

3. Model tiering Easy

60–95% savings vs. Opus

Not every task needs the most capable model. Routing simpler subtasks to Haiku while reserving Opus or Sonnet for reasoning-heavy steps can cut costs dramatically.

ModelInput $/MOutput $/MBest for
Claude Haiku 4.5$0.80$4.00Classification, extraction, simple Q&A
Claude Sonnet 4.6$3.00$15.00Code generation, reasoning, summarisation
Claude Opus 4.7$15.00$75.00Complex reasoning, multi-step agents

A practical tiering strategy: try Haiku first and escalate to Sonnet only when confidence is low (e.g. model returns "I don't know" or a low-confidence marker you define). Many production teams keep 70–80% of their traffic on Haiku this way.

4. Token compression Medium

10–40% prompt reduction

Every token costs money. Tightening your prompts — removing redundant instructions, whitespace, and verbose explanations — directly lowers every call.

High-ROI compression techniques:

Example: A 2,000-token system prompt trimmed to 1,200 tokens saves 800 tokens per call. At Sonnet pricing ($3/M), that's $0.0024 per call. Across 100,000 calls/month: $240/month saved with no model change.

5. Output token limits Easy

Variable — catches runaway completions

Output tokens are priced 4–5× higher than input tokens. Setting max_tokens explicitly prevents open-ended completions from blowing out your budget on a single call.

Guidelines by use case:

Task typeSuggested max_tokens
Classification / label only10–50
Short answer / yes-no50–200
Summarisation (paragraph)200–500
Code generation (function)500–1500
Long-form document draft2000–4000

Also consider asking the model to "be concise" or "answer in one sentence" — this reduces actual output length independently of max_tokens, and reduced output means a lower bill.

6. Streaming — use it for UX, not billing Medium

No direct cost — but prevents ghost tokens

Streaming itself doesn't reduce billed tokens. But it surfaces partial responses earlier, letting you cancel a generation early if the model goes off-track — saving the trailing output tokens.

If you stream and the first 50 tokens show the model is hallucinating or going off-topic, call stream.abort(). You're only billed for tokens generated up to that point, not the full completion. For long code-gen tasks this can save 500–2,000 output tokens per bad completion.

7. Measure first — you can't cut what you can't see Advanced

Prerequisite for everything else

Before optimizing, instrument your usage. Log usage.input_tokens, usage.output_tokens, usage.cache_read_input_tokens, and usage.cache_creation_input_tokens from every API response. Aggregate them by endpoint, user, or feature flag.

Use the Claude Code Cost Calculator to parse Claude Code session logs and see exactly where your tokens are going — by model, by tool call, and by hour. Paste your .jsonl log file and get an instant breakdown.

See where your tokens are going

Paste a Claude Code session log to get a free, instant cost breakdown by model, tool, and hour.

Open the free calculator →

Summary: quick wins vs. deep optimizations

TechniqueEffortTypical savingBest when
Prompt cachingLow50–90%Large repeated context
Batch APILow50%Async workloads
Model tieringLow60–95%Mixed-complexity tasks
Token compressionMedium10–40%Verbose system prompts
Output limitsLowVariableAll production calls
Streaming abortMediumVariableLong code-gen / agents
Usage instrumentationMediumPrerequisiteAny serious workload

Related guides