undefined

Practical strategies to cut Claude API costs: model selection, prompt caching (90% savings on repeated context), Batch API (50% discount), token counting, max_tokens tuning, and request batching.

💥 50p impulse-buy: Power Prompts PDF (first 10 buyers) 30 battle-tested Claude Code prompts · 8-page PDF · paste into CLAUDE.md and never re-type a prompt again · 50p impulse-buy, no commitment

Claude API Cost Optimization

Reduce your Claude API bill by 50–90% with the right combination of model selection, prompt caching, Batch API, and token hygiene. Each technique is independent — stack them for maximum savings.

1. Model selection — the biggest lever

ModelInput (per 1M tokens)Output (per 1M tokens)Best for
claude-haiku-4-5-20251001$0.80$4.00Classification, routing, short drafts, chatbots
claude-sonnet-4-6$3.00$15.00Reasoning, coding, summarization, RAG
claude-opus-4-7$15.00$75.00Complex analysis, architecture, premium tiers

Rule of thumb: start every new task on Haiku. Upgrade to Sonnet only if output quality is insufficient. Reserve Opus for tasks where accuracy directly affects revenue.

import anthropic

client = anthropic.Anthropic()

# Tier your models by task
MODELS = {
    "classify":  "claude-haiku-4-5-20251001",   # $0.80/1M input
    "summarize": "claude-sonnet-4-6",            # $3.00/1M input
    "analyze":   "claude-opus-4-7",              # $15.00/1M input
}

def call(task: str, prompt: str, max_tokens: int = 512) -> str:
    r = client.messages.create(
        model=MODELS[task],
        max_tokens=max_tokens,
        messages=[{"role": "user", "content": prompt}],
    )
    return r.content[0].text

2. Prompt caching — up to 90% savings on repeated context

Prompt caching lets you cache large, repeated context blocks (system prompts, documents, tool definitions). Cached input tokens cost $0.08/1M (vs $3.00 uncached on Sonnet) — a 90% reduction.

def cached_rag_query(document: str, question: str) -> str:
    """Cache the document; only the question is billed at full price."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=[
            {
                "type": "text",
                "text": "You are a document Q&A assistant. Answer only from the provided document.",
            },
            {
                "type": "text",
                "text": document,
                "cache_control": {"type": "ephemeral"},  # cache this block
            },
        ],
        messages=[{"role": "user", "content": question}],
    )
    usage = response.usage
    print(f"Cache read: {usage.cache_read_input_tokens} tokens (90% cheaper)")
    print(f"Cache created: {usage.cache_creation_input_tokens} tokens")
    return response.content[0].text

# First call: full price (cache miss) — subsequent calls are 90% cheaper
doc = open("manual.txt").read()
answer1 = cached_rag_query(doc, "What is the return policy?")
answer2 = cached_rag_query(doc, "How do I reset my password?")  # 90% cheaper

3. Batch API — 50% discount for non-real-time workloads

def batch_process(items: list[str], task_prompt: str) -> str:
    """Submit a batch job at 50% of standard pricing."""
    requests = [
        {
            "custom_id": f"item-{i}",
            "params": {
                "model": "claude-haiku-4-5-20251001",
                "max_tokens": 256,
                "messages": [{"role": "user", "content": f"{task_prompt}\n\n{item}"}],
            },
        }
        for i, item in enumerate(items)
    ]
    batch = client.messages.batches.create(message_batch=requests)
    print(f"Batch {batch.id} queued — results ready in <24h")
    return batch.id

# Use for: nightly report generation, bulk classification, data enrichment
batch_id = batch_process(
    items=["review text 1", "review text 2", "review text 3"],
    task_prompt="Classify sentiment as positive/neutral/negative. One word only.",
)

4. Count tokens before calling — catch expensive prompts early

def safe_call(prompt: str, budget_tokens: int = 10_000) -> str | None:
    """Count tokens first; skip if over budget."""
    count = client.messages.count_tokens(
        model="claude-sonnet-4-6",
        messages=[{"role": "user", "content": prompt}],
    )
    if count.input_tokens > budget_tokens:
        print(f"Skipped: {count.input_tokens} tokens exceeds budget {budget_tokens}")
        return None
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    ).content[0].text

5. Tune max_tokens — output tokens are billed too

# Bad: leaving max_tokens at 4096 for a one-sentence task
r = client.messages.create(model="claude-haiku-4-5-20251001", max_tokens=4096, ...)

# Good: set max_tokens to a realistic ceiling for your task
TASK_MAX_TOKENS = {
    "classify":  64,    # one word / short label
    "summarize": 256,   # 2-3 sentences
    "draft":     512,   # one email or paragraph
    "analyze":   2048,  # detailed report
}

r = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=TASK_MAX_TOKENS["classify"],
    messages=[{"role": "user", "content": "Classify: positive/neutral/negative. \n\n{text}"}],
)

6. Truncate inputs — don't send tokens you don't need

def trim_context(text: str, max_chars: int = 4000) -> str:
    """Truncate long documents before sending to Claude."""
    if len(text) <= max_chars:
        return text
    # Keep start (setup/instructions) and end (recent context)
    half = max_chars // 2
    return text[:half] + "\n\n[... middle truncated ...]\n\n" + text[-half:]

# ~4 chars per token on average — 4000 chars ≈ 1000 tokens ≈ $0.003 on Haiku
prompt = f"Summarize:\n\n{trim_context(long_document)}"

Cost reduction summary

TechniqueTypical savingsBest for
Use Haiku instead of Sonnet73–95%Any task Haiku handles well
Prompt cachingUp to 90% on cached tokensRepeated large context (docs, tools)
Batch API50%Non-real-time bulk processing
Right-size max_tokens10–50%Short-output tasks
Truncate inputs10–40%Pipelines with long context

Stack model selection + prompt caching + Batch API for compounding savings: switching a nightly batch job from Sonnet→Haiku (73% off) and adding the Batch API (50% off the already-reduced price) cuts the combined cost by ~87%. Use the Claude API Cost Calculator to model your specific usage before optimizing. For prompt caching with tool definitions, see the tool use guide. For Batch API details, see the Batch API guide.

Free tools

Cost Calculator → API Cookbook → Diff Summarizer → Skills Browser →

More examples

Claude API Python QuickstartClaude API Node.js / TypeScript QuickstartClaude API Streaming in PythonClaude API Streaming in Node.js / TypeScriptClaude API Tool Use in PythonClaude API Tool Use in Node.js / TypeScript