Build content moderation systems with the Claude API in Python. Zero-shot moderation, multi-category classifiers, explanation generation, bulk moderation via Batch API, and policy customization.
Claude performs zero-shot content moderation: describe your policy in a system prompt and Claude applies it with no labeled data, no training pipeline, and no redeployment when your policy changes. This guide covers every pattern from a simple safe/unsafe classifier to bulk Batch API moderation for high-volume pipelines.
pip install anthropic
import anthropic
import json
client = anthropic.Anthropic()
def moderate(text: str) -> dict:
message = client.messages.create(
model="claude-haiku-4-5-20251001", # fast + cheap for moderation
max_tokens=256,
system=(
"You are a content moderator. Return a JSON object with keys: "
"'flagged' (boolean), 'reason' (one sentence if flagged, else null). "
"Flag content that is: hateful, violent, sexually explicit, or spam. "
"No markdown fences."
),
messages=[{"role": "user", "content": f"Moderate this text:
{text}"}]
)
return json.loads(message.content[0].text)
result = moderate("Buy cheap meds online! Click here now!!")
# {"flagged": true, "reason": "Spam — unsolicited commercial promotion with urgency cue."}
CATEGORIES = ["hate_speech", "violence", "sexual_explicit", "spam", "harassment", "self_harm"]
def moderate_detailed(text: str) -> dict:
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
system=(
"You are a content moderator. Return a JSON object with keys: "
"'flagged' (boolean), "
"'categories' (list of matched categories from: " + ", ".join(CATEGORIES) + "), "
"'severity' ('low'|'medium'|'high'|null), "
"'action' ('allow'|'review'|'remove'), "
"'reason' (one sentence or null). "
"No markdown fences."
),
messages=[{"role": "user", "content": text}]
)
return json.loads(message.content[0].text)
result = moderate_detailed("I hate those people, they should all disappear.")
# {"flagged": true, "categories": ["hate_speech"], "severity": "high",
# "action": "remove", "reason": "Dehumanizing language targeting a group."}
COMMUNITY_POLICY = """
You moderate a cooking community. Flag content that:
1. Contains hate speech or harassment
2. Promotes dangerous food practices (e.g., eating raw chicken)
3. Is off-topic spam (e.g., cryptocurrency, supplements unrelated to cooking)
4. Contains explicit sexual content
Do NOT flag: spicy language about food, strong opinions on cuisine, healthy debate.
Return JSON: {"flagged": bool, "rule_violated": str|null, "action": "allow"|"review"|"remove", "reason": str|null}
"""
def moderate_community(text: str) -> dict:
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
system=COMMUNITY_POLICY,
messages=[{"role": "user", "content": text}]
)
return json.loads(message.content[0].text)
print(moderate_community("This risotto recipe is absolutely terrible, just like all Italian food."))
# {"flagged": false, "rule_violated": null, "action": "allow", "reason": null}
print(moderate_community("Buy our weight loss pills! DM for discount."))
# {"flagged": true, "rule_violated": "off-topic spam", "action": "remove", ...}
def bulk_moderate(texts: list[str]) -> list[dict]:
requests = [
{
"custom_id": f"item-{i}",
"params": {
"model": "claude-haiku-4-5-20251001",
"max_tokens": 256,
"system": "Moderate text. Return JSON: {'flagged': bool, 'reason': str|null}. No markdown.",
"messages": [{"role": "user", "content": text}]
}
}
for i, text in enumerate(texts)
]
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id} — poll until processing_status == 'ended'")
return batch.id # poll later; costs 50% less than real-time API
# Poll for results
def get_batch_results(batch_id: str) -> list[dict]:
import time
while True:
batch = client.messages.batches.retrieve(batch_id)
if batch.processing_status == "ended":
break
time.sleep(60)
results = []
for result in client.messages.batches.results(batch_id):
if result.result.type == "succeeded":
results.append({
"id": result.custom_id,
"moderation": json.loads(result.result.message.content[0].text)
})
return results
| Approach | Speed | Cost | Custom policy | Explanation | Best for |
|---|---|---|---|---|---|
| Claude (real-time) | ~0.5–1s | ~$0.08/1K items (Haiku) | Yes (prompt) | Yes | Nuanced community guidelines |
| Claude (Batch API) | Up to 24h | ~$0.04/1K items | Yes (prompt) | Yes | Daily content pipelines |
| OpenAI Moderation API | ~100ms | Free | No | No | Commodity safety screening |
| Keyword filter | ~1ms | Free | Via list | No | High-volume pre-filter |
| Fine-tuned BERT | ~50ms | Hosting cost | Retrain required | Limited | Fixed-policy high volume |
For high-volume pipelines, combine approaches: run a keyword pre-filter to block obvious content (free, ~1ms), then Claude on ambiguous items (~10–30% of traffic). This reduces Claude API calls by 70–90% while maintaining accuracy on nuanced content.
Estimate your moderation pipeline costs with the Claude API Cost Calculator. For classifying content into categories, see the text classification guide.