Practical strategies to cut Claude API costs: model selection, prompt caching (90% savings on repeated context), Batch API (50% discount), token counting, max_tokens tuning, and request batching.
Reduce your Claude API bill by 50–90% with the right combination of model selection, prompt caching, Batch API, and token hygiene. Each technique is independent — stack them for maximum savings.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best for |
|---|---|---|---|
| claude-haiku-4-5-20251001 | $0.80 | $4.00 | Classification, routing, short drafts, chatbots |
| claude-sonnet-4-6 | $3.00 | $15.00 | Reasoning, coding, summarization, RAG |
| claude-opus-4-7 | $15.00 | $75.00 | Complex analysis, architecture, premium tiers |
Rule of thumb: start every new task on Haiku. Upgrade to Sonnet only if output quality is insufficient. Reserve Opus for tasks where accuracy directly affects revenue.
import anthropic
client = anthropic.Anthropic()
# Tier your models by task
MODELS = {
"classify": "claude-haiku-4-5-20251001", # $0.80/1M input
"summarize": "claude-sonnet-4-6", # $3.00/1M input
"analyze": "claude-opus-4-7", # $15.00/1M input
}
def call(task: str, prompt: str, max_tokens: int = 512) -> str:
r = client.messages.create(
model=MODELS[task],
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}],
)
return r.content[0].text
Prompt caching lets you cache large, repeated context blocks (system prompts, documents, tool definitions). Cached input tokens cost $0.08/1M (vs $3.00 uncached on Sonnet) — a 90% reduction.
def cached_rag_query(document: str, question: str) -> str:
"""Cache the document; only the question is billed at full price."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[
{
"type": "text",
"text": "You are a document Q&A assistant. Answer only from the provided document.",
},
{
"type": "text",
"text": document,
"cache_control": {"type": "ephemeral"}, # cache this block
},
],
messages=[{"role": "user", "content": question}],
)
usage = response.usage
print(f"Cache read: {usage.cache_read_input_tokens} tokens (90% cheaper)")
print(f"Cache created: {usage.cache_creation_input_tokens} tokens")
return response.content[0].text
# First call: full price (cache miss) — subsequent calls are 90% cheaper
doc = open("manual.txt").read()
answer1 = cached_rag_query(doc, "What is the return policy?")
answer2 = cached_rag_query(doc, "How do I reset my password?") # 90% cheaper
def batch_process(items: list[str], task_prompt: str) -> str:
"""Submit a batch job at 50% of standard pricing."""
requests = [
{
"custom_id": f"item-{i}",
"params": {
"model": "claude-haiku-4-5-20251001",
"max_tokens": 256,
"messages": [{"role": "user", "content": f"{task_prompt}\n\n{item}"}],
},
}
for i, item in enumerate(items)
]
batch = client.messages.batches.create(message_batch=requests)
print(f"Batch {batch.id} queued — results ready in <24h")
return batch.id
# Use for: nightly report generation, bulk classification, data enrichment
batch_id = batch_process(
items=["review text 1", "review text 2", "review text 3"],
task_prompt="Classify sentiment as positive/neutral/negative. One word only.",
)
def safe_call(prompt: str, budget_tokens: int = 10_000) -> str | None:
"""Count tokens first; skip if over budget."""
count = client.messages.count_tokens(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": prompt}],
)
if count.input_tokens > budget_tokens:
print(f"Skipped: {count.input_tokens} tokens exceeds budget {budget_tokens}")
return None
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
).content[0].text
# Bad: leaving max_tokens at 4096 for a one-sentence task
r = client.messages.create(model="claude-haiku-4-5-20251001", max_tokens=4096, ...)
# Good: set max_tokens to a realistic ceiling for your task
TASK_MAX_TOKENS = {
"classify": 64, # one word / short label
"summarize": 256, # 2-3 sentences
"draft": 512, # one email or paragraph
"analyze": 2048, # detailed report
}
r = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=TASK_MAX_TOKENS["classify"],
messages=[{"role": "user", "content": "Classify: positive/neutral/negative. \n\n{text}"}],
)
def trim_context(text: str, max_chars: int = 4000) -> str:
"""Truncate long documents before sending to Claude."""
if len(text) <= max_chars:
return text
# Keep start (setup/instructions) and end (recent context)
half = max_chars // 2
return text[:half] + "\n\n[... middle truncated ...]\n\n" + text[-half:]
# ~4 chars per token on average — 4000 chars ≈ 1000 tokens ≈ $0.003 on Haiku
prompt = f"Summarize:\n\n{trim_context(long_document)}"
| Technique | Typical savings | Best for |
|---|---|---|
| Use Haiku instead of Sonnet | 73–95% | Any task Haiku handles well |
| Prompt caching | Up to 90% on cached tokens | Repeated large context (docs, tools) |
| Batch API | 50% | Non-real-time bulk processing |
| Right-size max_tokens | 10–50% | Short-output tasks |
| Truncate inputs | 10–40% | Pipelines with long context |
Stack model selection + prompt caching + Batch API for compounding savings: switching a nightly batch job from Sonnet→Haiku (73% off) and adding the Batch API (50% off the already-reduced price) cuts the combined cost by ~87%. Use the Claude API Cost Calculator to model your specific usage before optimizing. For prompt caching with tool definitions, see the tool use guide. For Batch API details, see the Batch API guide.