Claude Content Moderation with Python

Build content moderation systems with the Claude API in Python. Zero-shot moderation, multi-category classifiers, explanation generation, bulk moderation via Batch API, and policy customization.

💥 50p impulse-buy: Power Prompts PDF (first 10 buyers) 30 battle-tested Claude Code prompts · 8-page PDF · paste into CLAUDE.md and never re-type a prompt again · 50p impulse-buy, no commitment

Claude performs zero-shot content moderation: describe your policy in a system prompt and Claude applies it with no labeled data, no training pipeline, and no redeployment when your policy changes. This guide covers every pattern from a simple safe/unsafe classifier to bulk Batch API moderation for high-volume pipelines.

Installation

pip install anthropic

Simple safe/unsafe classifier

import anthropic
import json

client = anthropic.Anthropic()

def moderate(text: str) -> dict:
    message = client.messages.create(
        model="claude-haiku-4-5-20251001",  # fast + cheap for moderation
        max_tokens=256,
        system=(
            "You are a content moderator. Return a JSON object with keys: "
            "'flagged' (boolean), 'reason' (one sentence if flagged, else null). "
            "Flag content that is: hateful, violent, sexually explicit, or spam. "
            "No markdown fences."
        ),
        messages=[{"role": "user", "content": f"Moderate this text:

{text}"}]
    )
    return json.loads(message.content[0].text)

result = moderate("Buy cheap meds online! Click here now!!")
# {"flagged": true, "reason": "Spam — unsolicited commercial promotion with urgency cue."}

Multi-category classifier with severity

CATEGORIES = ["hate_speech", "violence", "sexual_explicit", "spam", "harassment", "self_harm"]

def moderate_detailed(text: str) -> dict:
    message = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system=(
            "You are a content moderator. Return a JSON object with keys: "
            "'flagged' (boolean), "
            "'categories' (list of matched categories from: " + ", ".join(CATEGORIES) + "), "
            "'severity' ('low'|'medium'|'high'|null), "
            "'action' ('allow'|'review'|'remove'), "
            "'reason' (one sentence or null). "
            "No markdown fences."
        ),
        messages=[{"role": "user", "content": text}]
    )
    return json.loads(message.content[0].text)

result = moderate_detailed("I hate those people, they should all disappear.")
# {"flagged": true, "categories": ["hate_speech"], "severity": "high",
#  "action": "remove", "reason": "Dehumanizing language targeting a group."}

Custom policy moderation (domain-specific rules)

COMMUNITY_POLICY = """
You moderate a cooking community. Flag content that:
1. Contains hate speech or harassment
2. Promotes dangerous food practices (e.g., eating raw chicken)
3. Is off-topic spam (e.g., cryptocurrency, supplements unrelated to cooking)
4. Contains explicit sexual content

Do NOT flag: spicy language about food, strong opinions on cuisine, healthy debate.
Return JSON: {"flagged": bool, "rule_violated": str|null, "action": "allow"|"review"|"remove", "reason": str|null}
"""

def moderate_community(text: str) -> dict:
    message = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        system=COMMUNITY_POLICY,
        messages=[{"role": "user", "content": text}]
    )
    return json.loads(message.content[0].text)

print(moderate_community("This risotto recipe is absolutely terrible, just like all Italian food."))
# {"flagged": false, "rule_violated": null, "action": "allow", "reason": null}

print(moderate_community("Buy our weight loss pills! DM for discount."))
# {"flagged": true, "rule_violated": "off-topic spam", "action": "remove", ...}

Bulk moderation via Batch API (50% cost reduction)

def bulk_moderate(texts: list[str]) -> list[dict]:
    requests = [
        {
            "custom_id": f"item-{i}",
            "params": {
                "model": "claude-haiku-4-5-20251001",
                "max_tokens": 256,
                "system": "Moderate text. Return JSON: {'flagged': bool, 'reason': str|null}. No markdown.",
                "messages": [{"role": "user", "content": text}]
            }
        }
        for i, text in enumerate(texts)
    ]

    batch = client.messages.batches.create(requests=requests)
    print(f"Batch ID: {batch.id} — poll until processing_status == 'ended'")
    return batch.id  # poll later; costs 50% less than real-time API

# Poll for results
def get_batch_results(batch_id: str) -> list[dict]:
    import time
    while True:
        batch = client.messages.batches.retrieve(batch_id)
        if batch.processing_status == "ended":
            break
        time.sleep(60)

    results = []
    for result in client.messages.batches.results(batch_id):
        if result.result.type == "succeeded":
            results.append({
                "id": result.custom_id,
                "moderation": json.loads(result.result.message.content[0].text)
            })
    return results

Moderation approach comparison

ApproachSpeedCostCustom policyExplanationBest for
Claude (real-time)~0.5–1s~$0.08/1K items (Haiku)Yes (prompt)YesNuanced community guidelines
Claude (Batch API)Up to 24h~$0.04/1K itemsYes (prompt)YesDaily content pipelines
OpenAI Moderation API~100msFreeNoNoCommodity safety screening
Keyword filter~1msFreeVia listNoHigh-volume pre-filter
Fine-tuned BERT~50msHosting costRetrain requiredLimitedFixed-policy high volume

For high-volume pipelines, combine approaches: run a keyword pre-filter to block obvious content (free, ~1ms), then Claude on ambiguous items (~10–30% of traffic). This reduces Claude API calls by 70–90% while maintaining accuracy on nuanced content.

Estimate your moderation pipeline costs with the Claude API Cost Calculator. For classifying content into categories, see the text classification guide.

Frequently asked questions

Can Claude moderate content without a training dataset?
Yes. Claude performs zero-shot content moderation — you describe your policy in the system prompt and Claude applies it immediately with no labeled data or fine-tuning required. This is the key advantage over rule-based filters (keyword lists) and supervised classifiers (BERT fine-tunes).
How accurate is Claude content moderation vs dedicated APIs?
Claude outperforms keyword filters on context-dependent content (sarcasm, coded language, cultural nuance) and matches fine-tuned BERT on standard benchmarks. It trails specialized moderation APIs (OpenAI Moderation, Perspective API) on speed and cost for high-volume commodity content (spam). For nuanced community guidelines or multi-label policies, Claude is typically more accurate.
How do I add domain-specific rules to Claude moderation?
Add them to the system prompt in plain English: 'Flag any content that promotes gambling products, even if not explicitly profane.' Claude treats your policy description as the ground truth — no retraining needed. Update the system prompt to change policy instantly across all future calls.
What is the cheapest way to bulk-moderate content with Claude?
Use the Batch API (`client.messages.batches.create`). It costs 50% less than the real-time Messages API and is ideal for moderating queued content: user-generated posts, comment queues, and daily content pipelines. Results are available within 24h.
How do I get explanations for moderation decisions?
Ask Claude to return JSON with a `reason` field: `{'flagged': true, 'category': 'hate_speech', 'reason': 'one-sentence explanation'}`. Explanations are valuable for user appeals, moderator review queues, and audit logs. They are free in token cost since the reasoning is short.

Free tools

Cost Calculator → API Cookbook → Diff Summarizer → Skills Browser →

More examples

Claude API Python QuickstartClaude API Node.js / TypeScript QuickstartClaude API Streaming in PythonClaude API Streaming in Node.js / TypeScriptClaude API Tool Use in PythonClaude API Tool Use in Node.js / TypeScript