Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

LLM Toxicity Detection Tools Compared: Perspective API, Detoxify, OpenAI Moderation, AWS Comprehend, Azure Content Safety, HF roberta-hate-speech — Real F1, Real Bias, Real Trade-offs (2026)

Six tools, six theories of how to keep toxic text out of your LLM inputs and outputs. Perspective API is Jigsaw's free hosted classifier with a decade of Civil Comments lineage. Detoxify is the open-source PyTorch checkpoint everyone forks. OpenAI Moderation is the free guardrail bundled with the API. AWS Comprehend Toxicity is the managed AWS service. Azure AI Content Safety is Microsoft's enterprise guardrail. Hugging Face's roberta-hate-speech-dynabench is the academic adversarial benchmark model. Sources cited inline, June 2026.

By DDH Research Team at Digital Dashboard HubUpdated

Engineering leaders shipping LLM features in 2026 do not get to skip toxicity detection. EU AI Act high-risk obligations, the FTC's enforcement posture, and the simple operational reality that a single screenshot of your chatbot saying something racist will end the project — all of it pushes toxicity classification from a nice-to-have to a required guardrail on both LLM input (user prompts) and output (model completions). The category has fractured into hosted SaaS APIs (Perspective, OpenAI, AWS, Azure), self-hostable open-source models (Detoxify, Hugging Face checkpoints), and bundled platform guardrails (Azure Content Safety, OpenAI Moderation). Pick wrong and you spend $40,000 a year on a managed service that misses code-switched slurs, or you self-host a checkpoint that flags Black English at twice the rate of standard American English. Before you commit to a stack, run the seat math through the AI content moderation cost calculator so the unit economics survive a real traffic spike.

**Perspective API** is the free Jigsaw-built classifier at https://perspectiveapi.com/ trained on Civil Comments — strong baseline, well-documented bias, no SLA. **Detoxify** is the OSS PyTorch family at https://github.com/unitaryai/detoxify, including original, unbiased, and multilingual checkpoints you self-host. **OpenAI Moderation** is the free omni-moderation-latest endpoint documented at https://platform.openai.com/docs/guides/moderation, covering 13 categories and 40+ languages. **AWS Comprehend Toxicity** at https://aws.amazon.com/comprehend/ is the pay-per-request managed service with seven categories. **Azure AI Content Safety** at https://azure.microsoft.com/products/ai-services/ai-content-safety is Microsoft's enterprise guardrail with severity-graded outputs and prompt-shield jailbreak detection. **Hugging Face roberta-hate-speech-dynabench** at https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target is the Meta AI adversarially-trained model that beats most off-the-shelf checkpoints on Dynabench. All prices and capabilities below are sourced from vendor pages as of June 2026.

The rest of this guide breaks down what each tool actually classifies, where each one fails, and which combination to ship to production. You will get a six-column decision matrix, a bias deep-dive citing the actual fairness papers, a build-vs-buy section, a five-step implementation plan, and the FAQs your security review will demand. We also compare these against jailbreak-specific defenses in LLM jailbreak prevention in 2026 and against the broader fairness tooling landscape in AI bias evaluation tools.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Perspective, Detoxify, OpenAI, AWS, Azure, HF roberta-hate-speech — capability + pricing overview, June 2026

Feature
Perspective API
Detoxify (OSS)
OpenAI Moderation
AWS Comprehend Toxicity
Azure Content Safety
HF roberta-hate-speech
OSS vs SaaSSaaS (free, hosted by Jigsaw)OSS (Apache 2.0, self-host)SaaS (bundled free with OpenAI API)SaaS (AWS managed)SaaS (Azure managed)OSS (MIT-style checkpoint on HF)
Pricing (June 2026)Free with rate limit (default 1 QPS, higher on request) per https://perspectiveapi.com/Free — pay only for your own GPU/CPU computeFree for OpenAI API customers per https://platform.openai.com/docs/guides/moderation~$0.0001 per 100 chars per https://aws.amazon.com/comprehend/pricing/~$0.75 per 1,000 text records (Standard tier) per https://azure.microsoft.com/products/ai-services/ai-content-safety/Free — pay only for your own GPU/CPU compute
Categories detected6 production attributes (TOXICITY, SEVERE_TOXICITY, IDENTITY_ATTACK, INSULT, PROFANITY, THREAT) + experimental3 model variants (original 7-class, unbiased 6-class, multilingual 7-class)13 categories incl. harassment, hate, self-harm, sexual/minors, violence, illicit7 categories (HATE_SPEECH, GRAPHIC, HARASSMENT_OR_ABUSE, SEXUAL, VIOLENCE_OR_THREAT, INSULT, PROFANITY)4 harm categories with 0-7 severity (Hate, Sexual, Violence, Self-Harm) + Prompt Shield + Groundedness1 binary label (hate vs not-hate, plus target group)
Multilingual coverage17+ languages incl. EN, ES, FR, DE, PT, IT, RU, JA, KO, NL, PL, SV, ZH per https://developers.perspectiveapi.com/s/about-the-api-attributes-and-languagesMultilingual checkpoint covers 7 languages (EN, FR, ES, IT, PT, TR, RU) per https://github.com/unitaryai/detoxify40+ languages on omni-moderation-latest per https://platform.openai.com/docs/guides/moderationEnglish-only as of June 2026 per https://docs.aws.amazon.com/comprehend/latest/dg/toxicity-detection.html100+ languages on text moderation per https://learn.microsoft.com/azure/ai-services/content-safety/English-only (trained on Dynabench R1-R4 English)
Typical latency (single short text)~150-400 ms p50 from US (hosted in GCP, network-bound)~5-40 ms p50 on GPU, ~80-300 ms on CPU (self-hosted, depends on hardware)~100-250 ms p50 per OpenAI status reports~80-200 ms p50 in same AWS region~80-200 ms p50 in same Azure region~10-60 ms p50 on GPU, ~200-500 ms on CPU
Benchmark F1 vs Civil Comments / Toxic-Spans~0.79 AUROC on Civil Comments per original Jigsaw paper at https://arxiv.org/abs/1903.04561Detoxify-unbiased reports 0.93 AUC on Civil Comments validation per https://github.com/unitaryai/detoxifyPer OpenAI's omni-moderation card at https://platform.openai.com/docs/guides/moderation OpenAI reports lower false-positive rates vs prior text-moderation-latest; no Civil Comments leaderboard number publishedNot benchmarked on Civil Comments by AWS; performance varies by category per https://docs.aws.amazon.com/comprehend/latest/dg/toxicity-detection.htmlAzure publishes per-severity precision/recall per category at https://learn.microsoft.com/azure/ai-services/content-safety/concepts/harm-categories but not a Civil Comments number~0.81 macro-F1 on Dynabench R4 per the model card; designed for adversarial hate, not Civil Comments-style toxicity
Context-aware (multi-turn)No — single-utterance classificationNo — single-utterance classificationPartial — omni-moderation accepts image inputs, single text per callNo — single-document classificationPartial — supports text + image, multi-turn chat moderation via Azure OpenAI ServiceNo — single-utterance classification
Bias on protected groupsDocumented identity bias on Civil Comments per https://arxiv.org/abs/1903.04561; flags mentions of Black, gay, Muslim identities more oftenDetoxify-unbiased explicitly trained to reduce identity bias per https://github.com/unitaryai/detoxify; outperforms original on identity subgroupsOpenAI publishes evaluation on group disparities in moderation model system card at https://openai.com/safety/ but does not publish per-identity F1No public per-identity fairness disclosure as of June 2026Azure publishes Responsible AI documentation at https://learn.microsoft.com/azure/ai-services/content-safety/concepts/response-codes but no per-identity F1 numbersAdversarially trained on Dynabench specifically to reduce identity bias per https://arxiv.org/abs/2012.15761
Production fitPrototyping, research, low-volume sites; rate limit makes prod hardSelf-hosted production where data residency or air-gap mattersDefault guardrail for any team already on OpenAI APIAWS-native shops wanting managed billing + IAMEnterprise + regulated (Microsoft 365, healthcare, government cloud)Research, adversarial red-team, evaluation harnesses
Self-hostableNoYes (PyTorch / HF Transformers)NoNoPartial — Azure Container Apps for Studio, not the core APIYes (HF Transformers)
Data residency optionsUS only (Google Cloud)Wherever you deployUS default; EU residency via OpenAI Enterprise per https://openai.com/enterprise-privacy/All AWS regions per https://aws.amazon.com/about-aws/global-infrastructure/30+ Azure regions incl. EU, UK, Australia, Canada per https://azure.microsoft.com/explore/global-infrastructure/Wherever you deploy
Best fitFree baseline, research, MVPsSelf-hosted production with engineering capacityAnyone shipping on OpenAI who wants a free first-pass guardrailAWS-first stacks with consolidated billingMicrosoft-stack enterprises and regulated industriesAdversarial hate-speech evaluation and red-team benchmarks

Sources as of June 2026 — verify at vendor pages: https://perspectiveapi.com/, https://github.com/unitaryai/detoxify, https://platform.openai.com/docs/guides/moderation, https://aws.amazon.com/comprehend/pricing/, https://azure.microsoft.com/products/ai-services/ai-content-safety/, https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target. Pricing and category lists change frequently — confirm in writing before any procurement decision.

What each tool actually classifies (and the marketing language you should ignore)

**Perspective API** is a single-utterance classifier that returns a probability between 0 and 1 for each of six production attributes — TOXICITY, SEVERE_TOXICITY, IDENTITY_ATTACK, INSULT, PROFANITY, THREAT — plus a handful of experimental attributes. It is free at the default 1 QPS rate limit, with higher limits granted on request per https://perspectiveapi.com/. It was trained primarily on Civil Comments, a dataset built from New York Times comment-section moderation labels, which means it works well on Anglophone news-comment-style toxicity and is genuinely weaker on slang, code-switched language, and reclaimed slurs. Treat the probability as a score, not a verdict — picking the right threshold for your traffic is the entire job.

**Detoxify** is the open-source family of PyTorch checkpoints maintained by Unitary at https://github.com/unitaryai/detoxify. It ships three models: original (7 classes, trained on the Jigsaw 2018 challenge), unbiased (6 classes, trained on Jigsaw 2019 unintended bias), and multilingual (7 classes across seven languages from Jigsaw 2020). The unbiased variant is the one most teams want — it explicitly penalizes false positives on identity subgroups. You self-host it, which means you also own latency, scaling, GPU costs, and patch upgrades. Per the repo, the unbiased model reports roughly 0.93 AUC on Civil Comments validation, which is strong, though AUC overstates real-world precision at deployment thresholds.

**OpenAI Moderation** is the free omni-moderation-latest endpoint documented at https://platform.openai.com/docs/guides/moderation. It returns boolean flags plus continuous scores for 13 categories — harassment, harassment/threatening, hate, hate/threatening, self-harm and three sub-types, sexual, sexual/minors, violence, violence/graphic, and the catch-all illicit. Crucially it is multilingual across 40-plus languages and accepts image inputs, which most competitors do not. It is free for anyone using the OpenAI API and is the default first-pass guardrail for most teams shipping on GPT-4o, GPT-4.1, or GPT-5. The honest caveat: OpenAI does not publish a Civil Comments leaderboard number, so you cannot apples-to-apples compare to Perspective or Detoxify without running your own eval.

**AWS Comprehend Toxicity** at https://aws.amazon.com/comprehend/ is the pay-per-request managed service that returns scores for seven categories — HATE_SPEECH, GRAPHIC, HARASSMENT_OR_ABUSE, SEXUAL, VIOLENCE_OR_THREAT, INSULT, and PROFANITY — plus an overall toxicity score. Pricing per https://aws.amazon.com/comprehend/pricing/ is approximately $0.0001 per 100 characters as of June 2026, which sounds small until you do the math on a million daily messages of average 200 characters and land at roughly $6,000 per month. The single biggest limitation is that as of June 2026 — verify at the docs page above — it is English-only. If you serve a non-English market, you are using something else, full stop.

**Azure AI Content Safety** at https://azure.microsoft.com/products/ai-services/ai-content-safety is Microsoft's enterprise guardrail platform. The text moderation API returns a 0-to-7 severity score across four harm categories — Hate, Sexual, Violence, Self-Harm — rather than a binary flag, which is materially more useful for graded enforcement (e.g., warn at severity 2, block at severity 5). It also bundles Prompt Shield for jailbreak/indirect-injection detection and Groundedness Detection for hallucination checks. Pricing per the product page is roughly $0.75 per 1,000 text records on the Standard tier, with substantial enterprise discounts and Microsoft Defender for Cloud bundling. It supports 100-plus languages, 30-plus Azure regions, and SOC 2 / ISO 27001 / HIPAA compliance.

**Hugging Face roberta-hate-speech-dynabench-r4-target** at https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target is the Meta AI checkpoint trained on the Dynabench dataset across four rounds of adversarial annotation. The contribution is methodological as much as practical: each round of training data was generated by human annotators trying to fool the previous model, which produces a model substantially harder to fool with simple adversarial prompts. It is English-only, single-label (hate vs not-hate, with target group prediction), and best used as an evaluation tool or an extra layer on top of a broader-coverage classifier — not as your only line of defense.


Architecture: how the toxicity layer plugs into your LLM stack

The canonical pattern in 2026 is dual-pass moderation: classify the user prompt before it reaches the model, then classify the model completion before it reaches the user. **OpenAI Moderation** makes this trivial because it is the same vendor and the same API key — one extra call per direction, free, low-latency. **Azure AI Content Safety** plugs into Azure OpenAI Service natively, with the moderation call wired into the same deployment via the AOAI content filter configuration documented at https://learn.microsoft.com/azure/ai-services/openai/concepts/content-filter. Both are the path of least resistance for teams already on those platforms.

If you are using a third-party LLM provider — say, Anthropic Claude or Mistral Large — you wire **Perspective**, **Detoxify**, **AWS Comprehend Toxicity**, or **Azure Content Safety** in as middleware around your LLM gateway. Most teams put it in their proxy or API gateway layer (Kong, Apigee, AWS API Gateway, or a custom Express/FastAPI middleware) so every model in the catalog gets the same guardrail without code changes per model. This decouples the moderation choice from the LLM choice, which matters when you are A/B-testing model providers monthly.

**Self-hosting Detoxify or the HF roberta-hate-speech checkpoint** is the right call when data residency, air-gap, or regulatory constraints require it — defense, healthcare, EU public sector. The operational cost is real: a single T4 GPU on AWS g4dn.xlarge runs about $380 per month on-demand per https://aws.amazon.com/ec2/instance-types/g4/, and you need at least two for redundancy. Latency is the upside — a Detoxify inference on GPU lands at 5-40 ms versus 150-400 ms for Perspective, which matters when your end-to-end LLM response budget is already 2-4 seconds.

Batching and caching matter more than people expect. Perspective API at 1 QPS default rate limit will not survive a real launch, so you batch up to 100 utterances per request and use a Redis cache keyed on a SHA-256 of the text to dedupe repeat-content. **AWS Comprehend Toxicity** supports BatchDetectToxicContent for up to 10 documents per call per https://docs.aws.amazon.com/comprehend/latest/dg/toxicity-detection.html, which is the right shape for streaming workloads where 100-utterance batches arrive every few seconds.

For LLM output streaming, you have a choice: classify the full completion at the end (simpler, adds 100-400 ms to TTFT-complete) or classify chunks as they stream (harder, requires tokenizer-aware chunking, but no perceived latency hit). **Azure AI Content Safety** supports streaming-friendly chunk classification per the AOAI content filter streaming mode docs at https://learn.microsoft.com/azure/ai-services/openai/how-to/content-streaming. Most teams start with end-of-completion classification and move to streaming only when user complaints about latency outweigh complaints about late-arriving moderation flags.

The architectural anti-pattern to avoid: trusting any single classifier as a hard gate without a human override path. Even the best of these models reports 5-15 percent false positive rates at production thresholds, and false negatives on adversarial content can be much higher. Build the moderation layer as a probabilistic score that drives a graded response — warn, soft-block, hard-block, escalate to human — not a single yes/no. The cost of being wrong is too asymmetric to put one model in charge.


Benchmark deep-dive: F1, AUC, and what the published numbers actually mean

Civil Comments is the canonical benchmark and the most-cited dataset in this space. The original Jigsaw paper at https://arxiv.org/abs/1903.04561 reports 0.79 AUROC for the production Perspective TOXICITY model on the Civil Comments test set. **Detoxify-unbiased** reports 0.93 AUC on the Civil Comments validation set per the Unitary repo at https://github.com/unitaryai/detoxify. These are not directly comparable — different splits, different thresholds, different score aggregation — but the rough takeaway is that a fine-tuned RoBERTa on the same training data outperforms the production Perspective endpoint, which has not been retrained on every recent Civil Comments iteration.

Toxic-Spans (SemEval 2021 Task 5) is the other major benchmark and tests span-level rather than utterance-level prediction. **Detoxify** and the academic HateBERT family do well here; **Perspective** and the major SaaS APIs do not natively expose span-level outputs, so they are typically benchmarked by mapping their utterance score onto every token, which understates their precision. If your use case requires highlighting the toxic span (e.g., for redaction or user feedback), this benchmark matters and the OSS checkpoints are the practical answer.

**Dynabench R1-R4** at https://dynabench.org/tasks/hs is the adversarial benchmark that birthed the HF roberta-hate-speech checkpoint. The model reports approximately 0.81 macro-F1 on the R4 test set per the model card. The interesting result, documented in the Vidgen et al. paper at https://arxiv.org/abs/2012.15761, is that the model trained on adversarial data also generalizes better to non-adversarial benchmarks — i.e., adversarial training is not just defense against red-team prompts, it is a general regularization technique that produces better classifiers.

**OpenAI** does not publish a Civil Comments number for omni-moderation-latest. The omni-moderation announcement at https://openai.com/index/upgrading-the-moderation-api-with-our-new-multimodal-moderation-model/ reports lower false-positive rates and broader multilingual support versus the older text-moderation-latest, with internal benchmarks on multilingual evaluation sets. The honest read: you cannot apples-to-apples compare OpenAI Moderation to Perspective or Detoxify on an academic leaderboard. You have to run your own evaluation on a labeled sample of your traffic.

**AWS Comprehend Toxicity** publishes no academic benchmark numbers as of June 2026 — verify at https://docs.aws.amazon.com/comprehend/latest/dg/toxicity-detection.html. **Azure AI Content Safety** publishes per-category severity precision/recall in its harm categories documentation at https://learn.microsoft.com/azure/ai-services/content-safety/concepts/harm-categories but again, no Civil Comments line. For both, the gap is reproducibility — academic benchmarks are reproducible by third parties, vendor-published numbers are not.

The practical takeaway on benchmarks: do not buy or self-host based on published F1 numbers. Build a 500-to-2,000 example labeled evaluation set from your own traffic — your prompts, your model's completions, your users' language patterns — and run every candidate classifier through it. Measure precision and recall at the threshold you would actually deploy. Half the time the cheapest free option wins; the other half the cost differential pays for itself in reduced false positive complaints. The benchmark that matters is your traffic, not Civil Comments.


Bias on protected groups: what each tool publishes and what it hides

The single biggest known issue with toxicity classifiers is identity bias: models flag mentions of Black, gay, Muslim, Jewish, or trans identities as toxic at higher rates than mentions of majority-group identities, even when the text is neutral or supportive. The seminal Jigsaw paper at https://arxiv.org/abs/1903.04561 documented this on Perspective specifically — sentences like 'I am a gay woman' would score higher on TOXICITY than comparable neutral mentions of other identities. That paper is the reason **Detoxify-unbiased** exists.

**Detoxify-unbiased** at https://github.com/unitaryai/detoxify was trained on the Jigsaw 2019 Unintended Bias dataset, which explicitly penalizes high false-positive rates on identity-mention subgroups. The repo publishes per-subgroup AUC for Black, white, Asian, Latino, Christian, Jewish, Muslim, female, male, LGBTQ, and other identity groups. The numbers are good but not perfect — typically 0.85-0.92 AUC across subgroups, with the lowest scores on Black and LGBTQ subgroups. If fairness is a board-level concern, this is the most transparent option on the list.

**Perspective** has published updated bias evaluations in the years since the 2019 paper and has retrained the underlying model. The current bias posture is documented at https://developers.perspectiveapi.com/s/about-the-api-best-practices-risks — Jigsaw acknowledges the bias and provides guidance on threshold tuning per identity, but they do not publish per-subgroup F1 for the production endpoint. The honest read: Perspective is much better than it was in 2019, but you cannot verify exactly how much better without running your own evaluation.

**OpenAI's omni-moderation model card** at https://platform.openai.com/docs/guides/moderation and the broader safety documentation at https://openai.com/safety/ acknowledge fairness evaluation as part of their model release process. They do not publish per-identity F1 numbers, which is a transparency gap relative to Detoxify. For regulated contexts where you need to demonstrate fairness to an auditor (EU AI Act, NYC bias audit law, EEOC), this is a non-trivial gap — you may need to run and document your own fairness evaluation on top of OpenAI Moderation regardless.

**AWS** publishes no per-identity fairness disclosure for Comprehend Toxicity as of June 2026 — verify at https://docs.aws.amazon.com/comprehend/latest/dg/toxicity-detection.html. **Azure AI Content Safety** publishes Responsible AI documentation and a transparency note at https://learn.microsoft.com/legal/cognitive-services/content-safety/transparency-note, which is more disclosure than AWS, but still does not include per-identity precision/recall numbers. For both, you should bake an in-house fairness audit into your acceptance criteria before going live.

**HF roberta-hate-speech-dynabench-r4-target** was explicitly trained on Dynabench to reduce identity bias through adversarial annotation per https://arxiv.org/abs/2012.15761. The Dynabench process invites annotators to write prompts that fool the model, including identity-mention edge cases, which produces a model that is harder to trick with reclaimed slurs, in-group humor, or counter-speech. It is single-label and English-only, so it is not a complete moderation solution, but as a fairness-focused second opinion in an ensemble it is the strongest published option.


Pricing and operational cost: what you will actually pay at scale

**Perspective API** is free at the default 1 QPS rate limit per https://perspectiveapi.com/, with higher limits granted on request through their partner program. The catch is that the rate limit is real — at 1 QPS you can moderate 86,400 utterances per day, which is fine for a research project and far short of what a real consumer product generates. Most production teams either request higher quota (free but with a manual review and a public-interest tilt) or use Perspective only for low-volume use cases like content takedown review. Pricing risk is zero; throughput risk is high.

**Detoxify** and **HF roberta-hate-speech** are free as software, paid as infrastructure. A reasonable production deployment of Detoxify-unbiased is two AWS g4dn.xlarge GPU instances on-demand at roughly $0.526 per hour each per https://aws.amazon.com/ec2/instance-types/g4/ — about $760 per month for the pair, plus load balancer and observability. That covers tens to hundreds of QPS depending on batch size and text length. The operational tradeoff is real engineering ownership: patches, scaling, model upgrades, fairness audits, and monitoring all live on your team.

**OpenAI Moderation** is free for OpenAI API customers per https://platform.openai.com/docs/guides/moderation. There is no per-call charge for the moderation endpoint, and rate limits are generous (typically aligned with your tier's overall rate limit). For teams already on OpenAI, the marginal cost of dual-pass moderation on every input and output is zero — which is the single strongest commercial argument for using it as the default first-pass guardrail regardless of your downstream LLM choice.

**AWS Comprehend Toxicity** prices at roughly $0.0001 per 100 characters per https://aws.amazon.com/comprehend/pricing/. The math gets real at scale: 1 million utterances per day at average 200 characters each is 200 million characters, or roughly $200 per day, or $6,000 per month. A 10-million-utterance-per-day workload — common for a mid-sized consumer product moderating both prompts and completions — is $60,000 per month. The same workload on self-hosted Detoxify is two GPU instances and an engineer's attention.

**Azure AI Content Safety** prices at roughly $0.75 per 1,000 text records on the Standard tier per https://azure.microsoft.com/products/ai-services/ai-content-safety/. At 1 million utterances per day, that is $750 per day or $22,500 per month at list — with substantial discounts available via Microsoft enterprise agreements and Defender for Cloud bundling. The price is higher than AWS per-call but the platform includes Prompt Shield, Groundedness Detection, and image moderation in the same SKU, which is genuinely useful if you need those.

The honest cost-per-decision math at 1 million utterances per day: OpenAI Moderation = $0 (assuming OpenAI API tier covers it), Perspective API = $0 (assuming you get rate-limit relief), self-hosted Detoxify = ~$1,000/month all-in, AWS Comprehend Toxicity = ~$6,000/month, Azure AI Content Safety = ~$15,000-22,500/month at list. The most expensive options are not the most accurate — they are the most enterprise-shaped (SLAs, compliance, support contracts). Pay for what you actually need, not for what looks defensible in a procurement deck.


Build vs. buy: when to use the model's native moderation

The first build-vs-buy question in 2026 is whether you can skip a dedicated moderation tool entirely and just use the LLM's native safety training. The honest answer: no, not for production. Frontier models from OpenAI, Anthropic, and Google all refuse most overtly toxic requests, but they are inconsistent on borderline content, can be jailbroken into non-refusal, and produce subtly harmful content (stereotypes, microaggressions, biased framing) that a refusal classifier does not catch. The model's native safety is your last line of defense, not your only one.

**Use OpenAI Moderation** if you are already on the OpenAI API. It is free, low-latency, multilingual, and trained by the same lab that trained the model you are calling. There is no reason not to wire it into both the input and output paths on day one. The relevant integration docs are at https://platform.openai.com/docs/guides/moderation. The honest limitation: it is a first-pass guardrail, not a fairness audit, and the per-identity numbers are not published.

**Use Azure AI Content Safety** if you are on Azure OpenAI Service. The content filter is wired into the AOAI deployment by default, with adjustable severity thresholds per harm category, and Prompt Shield catches a meaningful chunk of jailbreaks that pure toxicity classifiers miss. For Microsoft-stack enterprises this is the path of least resistance and the strongest enterprise-compliance posture on the list — SOC 2, ISO 27001, HIPAA, FedRAMP High for government cloud.

**Build on Detoxify** when data residency is a hard requirement, when you need span-level outputs the SaaS APIs do not expose, or when your traffic profile (10M+ moderation calls per day) makes the per-call SaaS pricing untenable. The all-in cost of a self-hosted Detoxify deployment with two GPUs, observability, and a part-time engineer is roughly $3,000-5,000 per month — break-even versus AWS Comprehend Toxicity somewhere around 500K daily utterances, versus Azure around 200K.

**Add HF roberta-hate-speech-dynabench-r4-target** as a second-opinion ensemble model when fairness is a board-level concern or when you are red-teaming for an adversarial threat model (election integrity, harassment campaigns, brigading). It is single-label and English-only, so it does not replace your primary classifier — but a two-of-three voting ensemble using Detoxify-unbiased, OpenAI Moderation, and Dynabench-roberta is the strongest practical guardrail you can build today.

**Skip Perspective API for production** unless your volume is low enough to live within the free tier rate limit and you genuinely want the Jigsaw research baseline. The free SaaS offering is excellent for prototyping, academic research, and low-volume forum moderation. Anyone shipping a consumer product at meaningful scale will outgrow it within a month. If you need a free, hosted, no-engineering-required guardrail at production scale, the answer in 2026 is OpenAI Moderation, not Perspective.


Implementation timeline: what the first 30, 60, 90 days look like

Days 1-7 is the eval set. Build a labeled evaluation dataset of 500-2,000 utterances drawn from your actual traffic — half prompts, half completions, labeled for toxicity, hate, harassment, sexual, violence, and self-harm by at least two human raters with documented inter-annotator agreement. This eval set is the single most important artifact in the whole project: it is what you use to compare classifiers, set thresholds, and prove safety to your leadership and regulators. Do not skip it because it is tedious. The vendor pitch decks will not let you skip it either.

Days 7-21 is candidate evaluation. Run **OpenAI Moderation**, **Perspective API**, and at least one self-hosted candidate (**Detoxify-unbiased** or **roberta-hate-speech-dynabench-r4-target**) on the eval set. Compute precision, recall, F1, and per-identity false-positive rates at three threshold settings (conservative, balanced, aggressive). If you are a regulated buyer, also run **AWS Comprehend Toxicity** or **Azure AI Content Safety** depending on your cloud. Document the results in a single table and share it with security, legal, and product before picking a winner.

Days 21-45 is integration. Wire the chosen classifier(s) into your LLM gateway as middleware, with dual-pass moderation on input and output. Add Redis caching keyed on SHA-256 of the input text to dedupe repeat content. Add structured logging of every moderation decision (with scores, threshold, and final action) into your observability stack so you can audit later. For Azure or OpenAI, this is mostly configuration work; for self-hosted Detoxify, this is real backend engineering. Budget 2-4 weeks of one senior engineer either way.

Days 45-60 is the graded enforcement design. A toxicity score is not a decision — your product team needs to design the graded response. Typical schema: scores below 0.3 pass silently; 0.3-0.6 log and continue; 0.6-0.85 soft-warn the user and rephrase the model's output; above 0.85 hard-block and offer an appeal path. Tune the thresholds with product, legal, and trust & safety in the room. The decision is policy, not engineering — get the right people approving it in writing.

Days 60-75 is the shadow-mode launch. Run the moderation layer in production but log decisions without enforcing them for the first week. Compare the production decisions against random samples reviewed by your trust & safety team. Tune thresholds. This is where you discover that your aggressive setting flags 8 percent of legitimate traffic as toxic and you need to back off — better to find that in shadow mode than after launch.

Days 75-90 is the staged rollout. Enable enforcement for 1 percent of traffic on day 75, 10 percent by day 80, 50 percent by day 85, 100 percent by day 90. Watch user complaints, false-positive appeal rates, and the moderation cost line in your cloud bill. Have a kill switch that disables enforcement (but keeps logging) within 60 seconds — you will need it the first time the classifier misbehaves on a viral piece of content. Document the runbook for the on-call engineer who will inherit this system.


The opinionated 2026 pick: what I would ship

If I were shipping a consumer LLM product on the OpenAI API tomorrow with a real audience, I would ship **OpenAI Moderation** as the always-on first-pass guardrail (free, multilingual, low-latency) plus **Detoxify-unbiased** self-hosted as a second-opinion classifier on flagged content. The dual-vendor design hedges against any single classifier's blind spot, and the cost is roughly $1,000 per month in GPU plus engineering time for the self-hosted layer. Verify the OpenAI side at https://platform.openai.com/docs/guides/moderation.

If I were on Azure OpenAI Service in a regulated industry — financial services, healthcare, government — I would ship **Azure AI Content Safety** with severity thresholds tuned by harm category, plus **Prompt Shield** for jailbreak detection, plus **Groundedness Detection** for hallucination checks on customer-facing completions. The price is higher than the OSS path but the compliance posture (SOC 2, ISO 27001, HIPAA, FedRAMP) and the bundled Prompt Shield are worth it for regulated buyers. Verify at https://azure.microsoft.com/products/ai-services/ai-content-safety/.

If I were data-residency-constrained — EU public sector, defense, or any team that cannot send user text to a US-hosted API — I would self-host **Detoxify-unbiased** plus **roberta-hate-speech-dynabench-r4-target** as an ensemble behind my own inference service. Two GPU instances, a Redis cache, structured logging, and a part-time engineer. Verify the models at https://github.com/unitaryai/detoxify and https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target.

If I were AWS-native with consolidated billing requirements, I would ship **AWS Comprehend Toxicity** for English traffic and self-host **Detoxify multilingual** for non-English. The split is uncomfortable but reflects the genuine gap: AWS Comprehend Toxicity does not support non-English as of June 2026 per https://docs.aws.amazon.com/comprehend/latest/dg/toxicity-detection.html. Validate with AWS account team that the gap has not closed before procurement.

If I were doing pure research, a small forum, or a low-traffic MVP, I would ship **Perspective API** as the free baseline and not pay for anything. The 1 QPS default rate limit will not survive a viral moment, but for genuine MVPs the answer is to ship now and reconsider when usage justifies the engineering. The Jigsaw research lineage and the published bias literature make Perspective the most studied option on the list.

The one thing I would not do in 2026 is rely on a single vendor for both input and output moderation in a high-stakes context. Even the best classifier has identity bias, multilingual gaps, and an adversarial brittleness that becomes the story when something goes wrong. Two classifiers in an ensemble, with a graded enforcement schema and a human appeal path, is the floor for any production deployment with real users.

How to pick and implement an LLM toxicity detection stack for your team

  1. 1

    Step 1: Build a labeled evaluation set from your own traffic

    Before you evaluate any vendor, sample 500-2,000 utterances from your actual prompts and completions. Have at least two human raters label each one across the harm taxonomy you care about (toxicity, hate, harassment, sexual, violence, self-harm), and compute inter-annotator agreement (Cohen's kappa should be at least 0.6 to call the labels reliable). This eval set is the only honest way to compare Perspective, Detoxify, OpenAI Moderation, AWS Comprehend Toxicity, Azure Content Safety, and the HF Dynabench model — vendor-published F1 numbers are on different datasets at different thresholds and are not comparable. Budget 2-3 weeks of trust & safety analyst time for this step, and reuse the eval set every quarter to detect classifier drift.

  2. 2

    Step 2: Run all candidates against the eval set and document results

    Score every candidate classifier on your eval set at three threshold settings (conservative, balanced, aggressive). Compute precision, recall, F1, AUROC, and — critically — per-identity false-positive rates for the protected groups your product cares about. The per-identity number is where you discover that Perspective flags neutral mentions of Black identity at 2x the rate of neutral mentions of white identity, or that AWS Comprehend Toxicity has a meaningful bias against Muslim mentions. Document the results in a single comparison table and review with legal, security, and product before picking a winner. If you cannot defend the choice with data, you did not finish the evaluation.

  3. 3

    Step 3: Design the graded enforcement policy with product and legal

    A classifier output is a probability, not a verdict. Design a graded response schema that maps score ranges to user-facing actions: 0.0-0.3 pass silently, 0.3-0.6 log only, 0.6-0.85 soft-warn and offer rephrasing, 0.85+ hard-block with an appeal path. Get the thresholds and the appeal mechanism approved in writing by product, legal, and trust & safety. The single biggest mistake teams make is treating moderation as a yes/no engineering decision when it is a graded policy decision. Document the policy in a single page that ships with the system; it will be the first thing your regulator asks for under the EU AI Act.

  4. 4

    Step 4: Integrate as middleware, not as a feature flag

    Wire the chosen classifier(s) into your LLM gateway as middleware, applied uniformly to every model deployment. Dual-pass moderation (input + output) is the floor; add a Redis cache keyed on SHA-256 of the text to dedupe repeat content. Add structured logging of every moderation decision (input hash, scores per category, threshold, final action, model version) to your observability stack with at least 90-day retention. Build a kill switch that disables enforcement (but preserves logging) within 60 seconds for the first time the classifier misbehaves on a viral piece of content. This is real backend engineering — budget 2-4 weeks of senior engineering time even for the simplest vendor (OpenAI Moderation).

  5. 5

    Step 5: Shadow-launch, then stage the rollout with a kill switch

    Run the moderation layer in production with logging but no enforcement for at least one week. Compare the would-be decisions against random samples reviewed by your trust & safety team. Tune thresholds. Then stage the rollout: 1 percent of traffic on day one, 10 percent by week one, 50 percent by week two, 100 percent by week three. Monitor three things: user complaints, false-positive appeal rates, and the moderation cost line in your cloud bill. Have a documented runbook for the on-call engineer covering kill-switch activation, threshold rollback, and incident escalation. The first time the classifier flags a viral piece of legitimate content, the runbook is what gets you to 'fixed in 15 minutes' instead of 'trending on X for six hours.'

Frequently Asked Questions

Is Perspective API still the best free toxicity classifier in 2026?

For research and prototyping, yes — Perspective API at https://perspectiveapi.com/ remains the most-cited free baseline and the dataset it was trained on (Civil Comments) is still the canonical academic benchmark. For production at any meaningful scale, no — the default 1 QPS rate limit is too restrictive and the higher quotas require partner approval. OpenAI Moderation has overtaken it as the default free production guardrail because it has no rate limit beyond the OpenAI API tier, ships multilingual support across 40-plus languages, and accepts image inputs. Use Perspective when you want the research lineage; use OpenAI Moderation when you want to ship.

How much does AWS Comprehend Toxicity actually cost at 1 million calls per day?

Per https://aws.amazon.com/comprehend/pricing/, AWS Comprehend Toxicity prices at roughly $0.0001 per 100 characters as of June 2026 — verify at the pricing page before procurement. At 1 million utterances per day averaging 200 characters each, that is 200 million characters or about $200 per day or $6,000 per month. At 10 million utterances per day (which is realistic if you moderate both LLM input and output on a mid-sized consumer product), you are at $60,000 per month. The same workload on self-hosted Detoxify is roughly $1,000-2,000 per month all-in. The AWS premium buys you managed billing, IAM integration, and an AWS SLA — not better accuracy.

Can I self-host any of these tools for data residency or regulated industry compliance?

Yes — Detoxify at https://github.com/unitaryai/detoxify and the HF roberta-hate-speech-dynabench checkpoint at https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target are both Apache/MIT-licensed PyTorch models you can run on your own infrastructure, including air-gapped environments. Perspective, OpenAI Moderation, AWS Comprehend Toxicity, and Azure AI Content Safety are SaaS-only. Azure Content Safety offers a partial self-host option via Azure Container Apps for the Studio interface, but the core moderation API itself is not self-hostable. For EU public sector, defense, or healthcare buyers with strict data residency requirements, the practical answer is self-hosted Detoxify with HF Dynabench as a second-opinion model.

How biased are these classifiers on protected identity groups, and which one is most transparent about it?

All of them have measurable identity bias — flagging mentions of Black, gay, Muslim, Jewish, or trans identities as toxic at higher rates than majority-group identity mentions, even in neutral or supportive sentences. The seminal Jigsaw paper documenting this is at https://arxiv.org/abs/1903.04561. Detoxify-unbiased at https://github.com/unitaryai/detoxify is the most transparent: it publishes per-subgroup AUC for 11+ identity groups. roberta-hate-speech-dynabench-r4-target at https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target was explicitly adversarially trained to reduce identity bias per https://arxiv.org/abs/2012.15761. Perspective, OpenAI, AWS, and Azure all acknowledge fairness as a goal but do not publish per-identity F1, which is a transparency gap for regulated buyers.

What is the difference between OpenAI Moderation and Azure AI Content Safety if I am on Azure OpenAI Service?

**OpenAI Moderation** is the free omni-moderation-latest endpoint accessible from any OpenAI API key, returning boolean flags plus scores for 13 categories per https://platform.openai.com/docs/guides/moderation. **Azure AI Content Safety** is Microsoft's enterprise guardrail platform that returns 0-7 severity scores across four harm categories, bundles Prompt Shield for jailbreak detection and Groundedness Detection for hallucination checks, and integrates natively with Azure OpenAI Service via the AOAI content filter at https://learn.microsoft.com/azure/ai-services/openai/concepts/content-filter. If you are on Azure OpenAI Service in a regulated industry, use Azure Content Safety — the bundled Prompt Shield alone is worth the price. If you are calling OpenAI directly, use OpenAI Moderation.

Do I need to moderate both the user input and the model output, or just one?

Both. The standard 2026 pattern is dual-pass moderation: classify the user prompt before it reaches the model (catches jailbreak attempts, abusive prompts, prohibited request categories) and classify the model completion before it reaches the user (catches hallucinated toxic content, jailbroken outputs that slipped past the prompt filter, and the model going off the rails on edge cases). Single-pass on prompts only misses output toxicity; single-pass on outputs only misses upstream abuse signals you want for trust & safety analytics. Both passes share infrastructure, both run the same classifier, and both add roughly 100-400 ms to end-to-end latency at typical SaaS API speeds. Budget the latency, do both.

How accurate are these classifiers on non-English content?

Accuracy drops meaningfully outside English for every option on this list. **OpenAI Moderation** omni-moderation-latest officially covers 40-plus languages per https://platform.openai.com/docs/guides/moderation and is the strongest multilingual option in the hosted SaaS category. **Azure AI Content Safety** covers 100-plus languages per https://learn.microsoft.com/azure/ai-services/content-safety/ but with varying accuracy by language. **Perspective API** covers 17+ languages per https://developers.perspectiveapi.com/s/about-the-api-attributes-and-languages. **AWS Comprehend Toxicity** is English-only as of June 2026. **Detoxify multilingual** covers seven languages (EN, FR, ES, IT, PT, TR, RU). For Japanese, Korean, or Chinese deployments, validate accuracy on your own eval set — vendor marketing claims and real-world performance diverge significantly outside the top 5 European languages.

How do I handle false positives without burying my trust & safety team in appeals?

Design a graded enforcement schema instead of a binary block: scores 0.0-0.3 pass silently, 0.3-0.6 log only, 0.6-0.85 soft-warn the user with optional rephrasing, 0.85+ hard-block with a one-click appeal path. Route appeals to a small trust & safety queue with a 24-hour SLA, and use the appeal outcomes to retune thresholds quarterly. Most real-world toxicity classifiers run a 5-15 percent false-positive rate at aggressive thresholds, so binary-blocking on every flagged item will swamp your team. The graded approach reduces user friction, reduces appeal volume, and gives you cleaner data for threshold tuning. Bake this into the product design from day one, not as an afterthought.

What is the cheapest credible toxicity detection stack for an LLM product in 2026?

If you are on the OpenAI API, the cheapest credible stack is **OpenAI Moderation** (free) for first-pass dual-side moderation, plus a quarterly evaluation against a labeled in-house eval set to catch drift. Total cost: $0 in API fees, plus 2-3 days per quarter of analyst time. If you are not on OpenAI or you want a second opinion, add self-hosted **Detoxify-unbiased** on a single GPU (~$400/month on AWS g4dn.xlarge per https://aws.amazon.com/ec2/instance-types/g4/) as an ensemble check on flagged content. Below this level, you are gambling on a single classifier's blind spots. Above this level, you are paying for enterprise compliance posture, not better accuracy.

You now know which toxicity classifier to ship. Now make every prompt your LLM guardrails actually evaluate hit clean.

AI Prompt Generator builds production-ready system prompts that work across ChatGPT, Claude, Gemini, OpenAI Moderation, Azure Content Safety, and every other guardrail tool in this article — so your moderation pipeline gets cleaner inputs and your trust & safety team gets fewer false positives, not generic AI fluff. Stop tweaking prompts by hand and start shipping prompts that drive measurable lift. 14-day free trial, no credit card required.

Browse all prompt tools →