Claude Structured Data Extraction with Python

Working Python code to extract structured data from unstructured text using the Claude API in 2026. Extract entities, invoice fields, contract terms, and custom schemas to JSON.

💥 50p impulse-buy: Power Prompts PDF (first 10 buyers) 30 battle-tested Claude Code prompts · 8-page PDF · paste into CLAUDE.md and never re-type a prompt again · 50p impulse-buy, no commitment

Claude is highly accurate at extracting structured data from unstructured text — invoices, contracts, emails, research papers, support tickets. No training data required; just describe your schema in the prompt.

Installation

pip install anthropic pydantic

Minimal extraction — key-value pairs from any text

import anthropic
import json

client = anthropic.Anthropic()

def extract(text: str, schema_description: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        temperature=0,
        system=(
            "You are a data extraction engine. "
            f"Extract the following fields and return ONLY valid JSON: {schema_description}. "
            "Use null for missing fields."
        ),
        messages=[{"role": "user", "content": text}],
    )
    return json.loads(response.content[0].text)

# Example: extract contact info from an email signature
email_sig = """
John Smith
Senior Engineer, Acme Corp
📧 john.smith@acme.com | 📞 +1 (555) 123-4567
linkedin.com/in/johnsmith
"""

result = extract(
    email_sig,
    "name (string), title (string), company (string), email (string), phone (string), linkedin_url (string)"
)
print(result)
# {
#   "name": "John Smith",
#   "title": "Senior Engineer",
#   "company": "Acme Corp",
#   "email": "john.smith@acme.com",
#   "phone": "+1 (555) 123-4567",
#   "linkedin_url": "linkedin.com/in/johnsmith"
# }

Invoice / receipt extraction with Pydantic validation

from pydantic import BaseModel, validator
from typing import Optional
import anthropic, json

client = anthropic.Anthropic()

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: Optional[str]
    invoice_date: Optional[str]  # ISO 8601
    vendor_name: Optional[str]
    vendor_address: Optional[str]
    customer_name: Optional[str]
    line_items: list[LineItem]
    subtotal: Optional[float]
    tax: Optional[float]
    total_due: Optional[float]
    currency: Optional[str]
    due_date: Optional[str]

def extract_invoice(text: str) -> Invoice:
    schema_json = Invoice.schema_json(indent=2)
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        temperature=0,
        system=(
            "You are an invoice extraction engine. "
            f"Extract all invoice fields and return ONLY valid JSON matching this schema:\n{schema_json}. "
            "Use null for missing fields."
        ),
        messages=[{"role": "user", "content": text}],
    )
    data = json.loads(response.content[0].text)
    return Invoice(**data)  # Pydantic validates types

invoice_text = """
INVOICE #INV-2026-0042
Date: 2026-05-10  |  Due: 2026-06-10
Vendor: TechSupplies Ltd, 123 Main St, Austin TX 78701

Bill To: Acme Corp, 456 Oak Ave, Denver CO 80201

Services:
  Cloud storage setup      1 unit   $500.00    $500.00
  API integration dev      8 hours  $150.00  $1,200.00
  Support (3 months)       1 unit   $300.00    $300.00

Subtotal: $2,000.00  |  Tax (8.25%): $165.00  |  Total Due: $2,165.00
"""

invoice = extract_invoice(invoice_text)
print(invoice.total_due)    # 2165.0
print(invoice.line_items[0].description)  # "Cloud storage setup"

Named entity recognition (NER) — people, orgs, places, dates

def extract_entities(text: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        temperature=0,
        system=(
            "Extract named entities and return ONLY valid JSON: "
            '{"people": [], "organizations": [], "locations": [], "dates": [], "monetary_values": []}'
        ),
        messages=[{"role": "user", "content": text}],
    )
    return json.loads(response.content[0].text)

news = (
    "Apple CEO Tim Cook announced a $500M investment in Austin, Texas on May 10, 2026, "
    "partnering with Dell Technologies to expand AI infrastructure."
)
entities = extract_entities(news)
print(entities["people"])         # ["Tim Cook"]
print(entities["organizations"])  # ["Apple", "Dell Technologies"]
print(entities["locations"])      # ["Austin, Texas"]
print(entities["monetary_values"])# ["$500M"]

Contract clause extraction

def extract_contract_terms(contract_text: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        temperature=0,
        system=(
            "You are a contract analysis engine. Extract key terms and return ONLY valid JSON: "
            '{"parties": [], "effective_date": null, "termination_date": null, '
            '"payment_terms": null, "notice_period_days": null, '
            '"governing_law": null, "non_compete_months": null, '
            '"liability_cap": null, "auto_renewal": null}'
        ),
        messages=[{"role": "user", "content": contract_text[:8000]}],  # trim very long contracts
    )
    return json.loads(response.content[0].text)

Batch extraction with the Batch API (50% cost)

documents = ["invoice text 1...", "invoice text 2...", ...]  # thousands of docs

requests = [
    {
        "custom_id": f"doc-{i}",
        "params": {
            "model": "claude-haiku-4-5-20251001",
            "max_tokens": 512,
            "temperature": 0,
            "system": "Extract vendor, amount, date. Return ONLY JSON: {vendor, amount, date}",
            "messages": [{"role": "user", "content": doc}],
        },
    }
    for i, doc in enumerate(documents)
]

batch = client.messages.batches.create(requests=requests)
print(f"Batch submitted: {batch.id}")  # poll within 24h for 50% cheaper results

Extraction cost reference

Document typeAvg input tokensOutput tokensCost / doc (Sonnet)Cost / doc (Haiku)
Short email300150$0.00054$0.000094
Invoice (1 page)800300$0.00180$0.000275
Contract (10 pages)8,000500$0.01575$0.00213
Research paper15,000600$0.02850$0.00390

Use the Claude API Cost Calculator to price your specific document volume. For classification (labeling documents rather than extracting fields), see the text classification guide. For summarizing long documents, see the summarization guide.

Frequently asked questions

How do I extract structured data from unstructured text with Claude?
Describe your target schema in the system prompt and ask Claude to return JSON. Set `temperature=0` for determinism and parse with `json.loads()`. For guaranteed schema compliance, wrap the call and retry if parsing fails — Claude's JSON compliance at temperature=0 is >99% for simple schemas.
Can Claude extract data from PDFs and images?
Yes. Use Claude's vision API to send PDF pages as base64-encoded images, or use the Files API to upload a PDF directly. Claude can extract tables, form fields, and text from scanned documents without OCR preprocessing.
How do I handle missing or ambiguous fields in extracted data?
Instruct Claude to use `null` for missing fields rather than omitting them. Add a field like `confidence: 0.0-1.0` for each value so you can flag low-confidence extractions for human review.
What is the cost of extracting data from 10,000 documents with Claude?
A typical 1,000-token invoice + 200-token JSON output costs ~$0.0018 with Claude Sonnet. 10,000 invoices = ~$18. Use the Batch API (50% discount) for bulk processing: ~$9 for 10,000 invoices. Claude Haiku halves the cost again for simpler documents.
How reliable is Claude for production data extraction pipelines?
Very reliable at temperature=0 for well-specified schemas. Add validation with Pydantic to enforce types and required fields. Retry on JSON parse errors (rare). For critical extraction (financial, medical), add a confidence threshold and route low-confidence items to human review.

Free tools

Cost Calculator → API Cookbook → Diff Summarizer → Skills Browser →

More examples

Claude API Python QuickstartClaude API Node.js / TypeScript QuickstartClaude API Streaming in PythonClaude API Streaming in Node.js / TypeScriptClaude API Tool Use in PythonClaude API Tool Use in Node.js / TypeScript