Claude API Rate Limits & 429 Error Handling in Python

Fix Anthropic API rate limit errors (429, overloaded_error) in Python. Working retry logic with exponential backoff, token-per-minute tracking, and concurrency controls.

💥 50p impulse-buy: Power Prompts PDF (first 10 buyers) 30 battle-tested Claude Code prompts · 8-page PDF · paste into CLAUDE.md and never re-type a prompt again · 50p impulse-buy, no commitment

Anthropic enforces requests-per-minute (RPM) and tokens-per-minute (TPM) limits. When exceeded, the API returns HTTP 429. Here's how to handle rate limits robustly in Python.

Quickstart: built-in retry

import anthropic

# SDK retries 429 automatically up to max_retries times
client = anthropic.Anthropic(max_retries=5)

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=512,
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.content[0].text)

Custom exponential backoff

import time
import random
import anthropic
from anthropic import RateLimitError, APIStatusError

client = anthropic.Anthropic(max_retries=0)  # handle manually

def call_with_backoff(messages, model="claude-haiku-4-5-20251001", max_tokens=512,
                      max_attempts=7):
    delay = 1.0
    for attempt in range(max_attempts):
        try:
            return client.messages.create(
                model=model, max_tokens=max_tokens, messages=messages
            )
        except RateLimitError:
            if attempt == max_attempts - 1:
                raise
            jitter = random.uniform(0, 1)
            wait = min(delay + jitter, 60)
            print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1})")
            time.sleep(wait)
            delay = min(delay * 2, 60)
        except APIStatusError as e:
            if e.status_code == 529:  # overloaded
                time.sleep(min(delay + random.uniform(0, 1), 60))
                delay *= 2
            else:
                raise

Token-per-minute tracking for batch workloads

import time
import threading
from collections import deque
import anthropic

class TokenBudgetedClient:
    """Stays under a tokens-per-minute cap."""

    def __init__(self, tpm_limit=80_000):
        self.client = anthropic.Anthropic(max_retries=3)
        self.tpm_limit = tpm_limit
        self.window = deque()  # (timestamp, tokens_used)
        self.lock = threading.Lock()

    def _tokens_in_last_minute(self):
        cutoff = time.time() - 60
        while self.window and self.window[0][0] < cutoff:
            self.window.popleft()
        return sum(t for _, t in self.window)

    def create(self, messages, model="claude-haiku-4-5-20251001", max_tokens=512):
        while True:
            with self.lock:
                used = self._tokens_in_last_minute()
                # conservative: assume max_tokens output
                if used + max_tokens < self.tpm_limit:
                    break
            time.sleep(2)

        response = self.client.messages.create(
            model=model, max_tokens=max_tokens, messages=messages
        )
        total = response.usage.input_tokens + response.usage.output_tokens
        with self.lock:
            self.window.append((time.time(), total))
        return response

# Usage
tbc = TokenBudgetedClient(tpm_limit=80_000)
resp = tbc.create([{"role": "user", "content": "Summarize this."}])

Parallel requests with concurrency cap

import asyncio
import anthropic

async def process_batch(prompts, concurrency=5, model="claude-haiku-4-5-20251001"):
    client = anthropic.AsyncAnthropic(max_retries=4)
    sem = asyncio.Semaphore(concurrency)  # max concurrent calls

    async def call(prompt):
        async with sem:
            return await client.messages.create(
                model=model,
                max_tokens=256,
                messages=[{"role": "user", "content": prompt}]
            )

    results = await asyncio.gather(*[call(p) for p in prompts], return_exceptions=True)
    await client.close()
    return results

# Run
prompts = ["Translate 'hello' to French.", "Translate 'hello' to German."]
results = asyncio.run(process_batch(prompts))

Rate limit tiers (2026)

PlanRPMTPM (Haiku)TPM (Sonnet)TPM (Opus)
Free525K25K10K
Build50100K80K20K
ScaleCustomCustomCustomCustom

Track spending and token usage in real time with the Claude API Cost Calculator. For Batch API workloads (50% discount, async), see the cost optimization guide.

Frequently asked questions

What are the Claude API rate limits?
Limits vary by plan and model. Free tier: 5 req/min, 25K tokens/min. Build tier: 50 req/min, 100K tokens/min per model. Scale tier: custom. Check your current limits at console.anthropic.com under Settings → Limits.
What HTTP status code does Claude return for rate limit errors?
HTTP 429 with error type 'rate_limit_error'. A separate 'overloaded_error' (also 529 in older SDK versions) means Anthropic's servers are temporarily busy — treat it the same as 429 with backoff.
What is the best retry strategy for Claude API rate limits?
Exponential backoff with jitter: start at 1s, double each attempt, add random 0–1s jitter, cap at 60s. Retry up to 5–7 times. The Anthropic SDK's built-in max_retries=3 uses this pattern automatically.
How do I check my remaining Claude API quota?
The response headers include x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens. In the Python SDK, access them via response.http_response.headers after the call.
Does the Anthropic Python SDK handle rate limits automatically?
Yes — the SDK retries on 429 up to max_retries times (default 2) with exponential backoff. Set max_retries=5 on the client for more resilience. For batch workloads, add your own token-per-minute counter on top.

Free tools

Cost Calculator → API Cookbook → Diff Summarizer → Skills Browser →

More examples

Claude API Python QuickstartClaude API Node.js / TypeScript QuickstartClaude API Streaming in PythonClaude API Streaming in Node.js / TypeScriptClaude API Tool Use in PythonClaude API Tool Use in Node.js / TypeScript