Claude PDF Analysis in Python

Extract text, summarize, and analyze PDFs with the Claude API in Python. Pass PDFs as base64 documents or URLs and ask questions about the content.

💥 50p impulse-buy: Power Prompts PDF (first 10 buyers) 30 battle-tested Claude Code prompts · 8-page PDF · paste into CLAUDE.md and never re-type a prompt again · 50p impulse-buy, no commitment

Claude can read, summarize, extract data from, and answer questions about PDFs — no pre-processing or chunking required for typical document sizes.

Analyze a local PDF

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def analyze_pdf(pdf_path: str, question: str) -> str:
    pdf_data = base64.standard_b64encode(Path(pdf_path).read_bytes()).decode()

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    }
                },
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text

summary = analyze_pdf("report.pdf", "Summarize the key findings in 3 bullet points.")
print(summary)

Analyze a PDF from a URL

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "url",
                    "url": "https://arxiv.org/pdf/2310.06825.pdf"
                }
            },
            {"type": "text", "text": "What is the main contribution of this paper?"}
        ]
    }]
)

Extract structured data from a PDF

import json

def extract_invoice_data(pdf_path: str) -> dict:
    pdf_data = base64.standard_b64encode(Path(pdf_path).read_bytes()).decode()

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        temperature=0,
        messages=[{
            "role": "user",
            "content": [
                {"type": "document",
                 "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data}},
                {"type": "text",
                 "text": 'Extract invoice data as JSON: {"invoice_number": str, "date": str, "vendor": str, "total_amount": float, "line_items": [{"description": str, "amount": float}]}. Return only JSON.'}
            ]
        }]
    )
    return json.loads(response.content[0].text)

data = extract_invoice_data("invoice.pdf")
print(f"Invoice #{data['invoice_number']}: ${data['total_amount']}")

Cache a large PDF for multiple questions

pdf_data = base64.standard_b64encode(Path("large_report.pdf").read_bytes()).decode()

def ask_about_pdf(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data},
                    "cache_control": {"type": "ephemeral"}  # cache the PDF for 5 min
                },
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text

# First call: cache_write (25% premium)
print(ask_about_pdf("What is the executive summary?"))
# Second call: cache_read (90% discount — large PDF = big savings)
print(ask_about_pdf("List all recommendations."))

Caching large PDFs with cache_control is especially cost-effective — see the prompt caching example. For vision-based image analysis, see the vision API example.

Frequently asked questions

What is the maximum PDF size Claude can process?
Up to 32MB per PDF file (base64 encoded). Claude's context window can hold up to 200K tokens of document content. A typical 50-page PDF is approximately 25K–75K tokens depending on text density.
Does Claude read scanned (image) PDFs?
Yes. Claude applies OCR to image-based PDFs automatically. Text quality in the extracted output depends on scan quality. For very low-resolution scans, consider pre-processing with a dedicated OCR tool first.
Can I pass PDF URLs instead of base64?
Yes. Use `source: {type: 'url', url: 'https://...'}` for publicly accessible PDFs. URL-sourced documents are fetched by Anthropic's servers. For private documents, use base64 encoding.

Free tools

Cost Calculator → API Cookbook → Diff Summarizer → Skills Browser →

More examples

Claude API Python QuickstartClaude API Node.js / TypeScript QuickstartClaude API Streaming in PythonClaude API Streaming in Node.js / TypeScriptClaude API Tool Use in PythonClaude API Tool Use in Node.js / TypeScript