Anthropic's Claude API is powerful but the bill can surprise you. Here are seven concrete techniques developers use to cut Claude API costs by 40–90%, with real dollar examples for each.
If your application prepends a large system prompt or document to every call, prompt caching is the single biggest lever you have. Anthropic charges cache-write tokens at 1.25× the normal input rate, then cache-read tokens at just 10% of the normal input rate.
Enable caching by adding "cache_control": {"type": "ephemeral"} to the last block you want cached in your system array. The cache lives for 5 minutes by default; long-running pipelines should refresh it periodically.
| Token type | Rate (vs normal input) | When charged |
|---|---|---|
| cache_write | 1.25× | First call that fills the cache |
| cache_read | 0.10× | Every subsequent call while cached |
| Output tokens | 1.00× (normal output rate) | Not affected by caching |
Anthropic's Message Batches API gives a flat 50% discount on every token (input and output) for requests that don't need a real-time response. Results are available within 24 hours, usually much sooner.
Batch API is perfect for:
Not every task needs the most capable model. Routing simpler subtasks to Haiku while reserving Opus or Sonnet for reasoning-heavy steps can cut costs dramatically.
| Model | Input $/M | Output $/M | Best for |
|---|---|---|---|
| Claude Haiku 4.5 | $0.80 | $4.00 | Classification, extraction, simple Q&A |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Code generation, reasoning, summarisation |
| Claude Opus 4.7 | $15.00 | $75.00 | Complex reasoning, multi-step agents |
A practical tiering strategy: try Haiku first and escalate to Sonnet only when confidence is low (e.g. model returns "I don't know" or a low-confidence marker you define). Many production teams keep 70–80% of their traffic on Haiku this way.
Every token costs money. Tightening your prompts — removing redundant instructions, whitespace, and verbose explanations — directly lowers every call.
High-ROI compression techniques:
Output tokens are priced 4–5× higher than input tokens. Setting max_tokens explicitly prevents open-ended completions from blowing out your budget on a single call.
Guidelines by use case:
| Task type | Suggested max_tokens |
|---|---|
| Classification / label only | 10–50 |
| Short answer / yes-no | 50–200 |
| Summarisation (paragraph) | 200–500 |
| Code generation (function) | 500–1500 |
| Long-form document draft | 2000–4000 |
Also consider asking the model to "be concise" or "answer in one sentence" — this reduces actual output length independently of max_tokens, and reduced output means a lower bill.
Streaming itself doesn't reduce billed tokens. But it surfaces partial responses earlier, letting you cancel a generation early if the model goes off-track — saving the trailing output tokens.
If you stream and the first 50 tokens show the model is hallucinating or going off-topic, call stream.abort(). You're only billed for tokens generated up to that point, not the full completion. For long code-gen tasks this can save 500–2,000 output tokens per bad completion.
Before optimizing, instrument your usage. Log usage.input_tokens, usage.output_tokens, usage.cache_read_input_tokens, and usage.cache_creation_input_tokens from every API response. Aggregate them by endpoint, user, or feature flag.
Use the Claude Code Cost Calculator to parse Claude Code session logs and see exactly where your tokens are going — by model, by tool call, and by hour. Paste your .jsonl log file and get an instant breakdown.
Paste a Claude Code session log to get a free, instant cost breakdown by model, tool, and hour.
Open the free calculator →| Technique | Effort | Typical saving | Best when |
|---|---|---|---|
| Prompt caching | Low | 50–90% | Large repeated context |
| Batch API | Low | 50% | Async workloads |
| Model tiering | Low | 60–95% | Mixed-complexity tasks |
| Token compression | Medium | 10–40% | Verbose system prompts |
| Output limits | Low | Variable | All production calls |
| Streaming abort | Medium | Variable | Long code-gen / agents |
| Usage instrumentation | Medium | Prerequisite | Any serious workload |