Claude Structured Data Extraction Python (2026 Working Example)

Working Python code to extract structured data from unstructured text using the Claude API in 2026. Extract entities, invoice fields, contract terms, and custom schemas to JSON.

Claude is highly accurate at extracting structured data from unstructured text — invoices, contracts, emails, research papers, support tickets. No training data required; just describe your schema in the prompt.

Installation

Minimal extraction — key-value pairs from any text

Invoice / receipt extraction with Pydantic validation

Named entity recognition (NER) — people, orgs, places, dates

Contract clause extraction

Batch extraction with the Batch API (50% cost)

Extraction cost reference

Frequently asked questions

Document type	Avg input tokens	Output tokens	Cost / doc (Sonnet)	Cost / doc (Haiku)
Short email	300	150	$0.00054	$0.000094
Invoice (1 page)	800	300	$0.00180	$0.000275
Contract (10 pages)	8,000	500	$0.01575	$0.00213
Research paper	15,000	600	$0.02850	$0.00390

How do I extract structured data from unstructured text with Claude?

Describe your target schema in the system prompt and ask Claude to return JSON. Set `temperature=0` for determinism and parse with `json.loads()`. For guaranteed schema compliance, wrap the call and retry if parsing fails — Claude's JSON compliance at temperature=0 is >99% for simple schemas.

Can Claude extract data from PDFs and images?

Yes. Use Claude's vision API to send PDF pages as base64-encoded images, or use the Files API to upload a PDF directly. Claude can extract tables, form fields, and text from scanned documents without OCR preprocessing.

How do I handle missing or ambiguous fields in extracted data?

Instruct Claude to use `null` for missing fields rather than omitting them. Add a field like `confidence: 0.0-1.0` for each value so you can flag low-confidence extractions for human review.

What is the cost of extracting data from 10,000 documents with Claude?

A typical 1,000-token invoice + 200-token JSON output costs ~$0.0018 with Claude Sonnet. 10,000 invoices = ~$18. Use the Batch API (50% discount) for bulk processing: ~$9 for 10,000 invoices. Claude Haiku halves the cost again for simpler documents.

How reliable is Claude for production data extraction pipelines?

Very reliable at temperature=0 for well-specified schemas. Add validation with Pydantic to enforce types and required fields. Retry on JSON parse errors (rare). For critical extraction (financial, medical), add a confidence threshold and route low-confidence items to human review.

Claude Structured Data Extraction with Python