What each classifier actually does (and the model-card claims to read carefully)
**Llama Guard 3 8B** is Meta's third-generation safety classifier, fine-tuned from Llama 3.1 8B on a hazard taxonomy aligned with MLCommons. It accepts a chat-format prompt — system, user, and optionally assistant turns — and returns a structured response indicating safe or unsafe plus the violated category code (S1 through S14). It is designed to classify both user prompts and assistant responses in the same model, which makes it a one-stop sidecar for many architectures. Per the official model card at https://huggingface.co/meta-llama/Llama-Guard-3-8B, it covers 14 hazard categories including violent crimes, sexual content, hate speech, suicide and self-harm, weapons, and a notable S14 category for code-interpreter abuse.
**ShieldGemma** is Google's family of three classifiers built on Gemma 2 — the 2B, 9B, and 27B sizes available at https://huggingface.co/google/shieldgemma-2b, https://huggingface.co/google/shieldgemma-9b, and https://huggingface.co/google/shieldgemma-27b. Unlike Llama Guard's category-code output, ShieldGemma emits a Yes/No token with a confidence probability for each of four harm types: harassment, hate speech, sexually explicit, and dangerous content. You query each harm separately, which is more compute but gives you a granular score per category — useful for orgs that want to tune thresholds per harm rather than accepting a single binary decision.
**Prompt Guard 86M** is a different beast. It is not a content classifier in the LlamaGuard sense — it is a tiny DeBERTa-v3 model fine-tuned specifically to detect prompt injection and jailbreak attempts at the input layer. Per https://huggingface.co/microsoft/Prompt-Guard-86M, it returns one of three labels: BENIGN, INJECTION, or JAILBREAK. At 86M parameters it runs in milliseconds on CPU and is designed as a cheap pre-filter before you spend tokens on a heavier classifier or your main model. Originally released by Meta under the Llama 3 Community License; the Microsoft-hosted copy on Hugging Face mirrors the same weights. Treat it as a screen, not a comprehensive policy engine.
**Granite Guardian 3 8B** is IBM's enterprise-grade classifier from the Granite 3.0 family, available under Apache 2.0 at https://huggingface.co/ibm-granite/granite-guardian-3.0-8b. Its differentiator is breadth: it covers harm, social bias, jailbreak, violence, sexual content, profanity, unethical behavior, plus explicit hallucination and function-calling risk categories. It is the only model on this list that natively scores hallucination — useful if you are running a RAG application and want a single sidecar to flag both unsafe and unfaithful outputs. IBM also ships a 2B version at https://huggingface.co/ibm-granite/granite-guardian-3.0-2b for latency-sensitive deployments.
**WildGuard 7B** is Allen AI's research-grade unified safety classifier, fine-tuned from Mistral 7B and released under Apache 2.0 at https://huggingface.co/allenai/wildguard. The release is paired with the WildGuardMix training dataset and WildGuardTest benchmark, which is the unusually transparent part — most safety classifiers ship as black boxes, but Allen AI published the data so you can audit and reproduce. WildGuard scores three things in one pass: whether the user prompt is harmful, whether the model response is harmful, and whether the model refused. It is the strongest fit for research and evaluation pipelines, where reproducibility matters more than enterprise SLAs.
The marketing claim to read carefully across all six: 'state of the art on harm classification.' Every vendor claims this, every vendor benchmarks against a slightly different test set, and the comparisons are not directly apples-to-apples. The IBM model card claims Granite Guardian outperforms Llama Guard 3 on aggregate harm benchmarks per https://huggingface.co/ibm-granite/granite-guardian-3.0-8b. The Allen AI paper claims WildGuard outperforms GPT-4 on the WildGuardTest set per https://huggingface.co/allenai/wildguard. Both are true on their respective evaluations. Neither tells you what will happen on your traffic. Run your own eval before you commit.