What each tool actually classifies (and the marketing language you should ignore)
**Perspective API** is a single-utterance classifier that returns a probability between 0 and 1 for each of six production attributes — TOXICITY, SEVERE_TOXICITY, IDENTITY_ATTACK, INSULT, PROFANITY, THREAT — plus a handful of experimental attributes. It is free at the default 1 QPS rate limit, with higher limits granted on request per https://perspectiveapi.com/. It was trained primarily on Civil Comments, a dataset built from New York Times comment-section moderation labels, which means it works well on Anglophone news-comment-style toxicity and is genuinely weaker on slang, code-switched language, and reclaimed slurs. Treat the probability as a score, not a verdict — picking the right threshold for your traffic is the entire job.
**Detoxify** is the open-source family of PyTorch checkpoints maintained by Unitary at https://github.com/unitaryai/detoxify. It ships three models: original (7 classes, trained on the Jigsaw 2018 challenge), unbiased (6 classes, trained on Jigsaw 2019 unintended bias), and multilingual (7 classes across seven languages from Jigsaw 2020). The unbiased variant is the one most teams want — it explicitly penalizes false positives on identity subgroups. You self-host it, which means you also own latency, scaling, GPU costs, and patch upgrades. Per the repo, the unbiased model reports roughly 0.93 AUC on Civil Comments validation, which is strong, though AUC overstates real-world precision at deployment thresholds.
**OpenAI Moderation** is the free omni-moderation-latest endpoint documented at https://platform.openai.com/docs/guides/moderation. It returns boolean flags plus continuous scores for 13 categories — harassment, harassment/threatening, hate, hate/threatening, self-harm and three sub-types, sexual, sexual/minors, violence, violence/graphic, and the catch-all illicit. Crucially it is multilingual across 40-plus languages and accepts image inputs, which most competitors do not. It is free for anyone using the OpenAI API and is the default first-pass guardrail for most teams shipping on GPT-4o, GPT-4.1, or GPT-5. The honest caveat: OpenAI does not publish a Civil Comments leaderboard number, so you cannot apples-to-apples compare to Perspective or Detoxify without running your own eval.
**AWS Comprehend Toxicity** at https://aws.amazon.com/comprehend/ is the pay-per-request managed service that returns scores for seven categories — HATE_SPEECH, GRAPHIC, HARASSMENT_OR_ABUSE, SEXUAL, VIOLENCE_OR_THREAT, INSULT, and PROFANITY — plus an overall toxicity score. Pricing per https://aws.amazon.com/comprehend/pricing/ is approximately $0.0001 per 100 characters as of June 2026, which sounds small until you do the math on a million daily messages of average 200 characters and land at roughly $6,000 per month. The single biggest limitation is that as of June 2026 — verify at the docs page above — it is English-only. If you serve a non-English market, you are using something else, full stop.
**Azure AI Content Safety** at https://azure.microsoft.com/products/ai-services/ai-content-safety is Microsoft's enterprise guardrail platform. The text moderation API returns a 0-to-7 severity score across four harm categories — Hate, Sexual, Violence, Self-Harm — rather than a binary flag, which is materially more useful for graded enforcement (e.g., warn at severity 2, block at severity 5). It also bundles Prompt Shield for jailbreak/indirect-injection detection and Groundedness Detection for hallucination checks. Pricing per the product page is roughly $0.75 per 1,000 text records on the Standard tier, with substantial enterprise discounts and Microsoft Defender for Cloud bundling. It supports 100-plus languages, 30-plus Azure regions, and SOC 2 / ISO 27001 / HIPAA compliance.
**Hugging Face roberta-hate-speech-dynabench-r4-target** at https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target is the Meta AI checkpoint trained on the Dynabench dataset across four rounds of adversarial annotation. The contribution is methodological as much as practical: each round of training data was generated by human annotators trying to fool the previous model, which produces a model substantially harder to fool with simple adversarial prompts. It is English-only, single-label (hate vs not-hate, with target group prediction), and best used as an evaluation tool or an extra layer on top of a broader-coverage classifier — not as your only line of defense.