Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Llama Guard 3 vs ShieldGemma vs Prompt Guard vs Granite Guardian vs WildGuard: The Open-Source Safety Classifier Showdown (2026)

Six open-source safety classifiers, six different theories of how to keep an LLM from saying something stupid. Meta's Llama Guard 3 is the de facto baseline. Google's ShieldGemma offers three sizes and per-category confidence scores. Microsoft's Prompt Guard 86M is the tiny jailbreak sniffer. IBM's Granite Guardian 3 ships under Apache 2.0 with the most permissive license. Allen AI's WildGuard is the research-grade unified classifier. Sources cited inline, June 2026.

By DDH Research Team at Digital Dashboard HubUpdated

If you are shipping an LLM-powered product in 2026, you are not asking whether to add a safety sidecar — you are asking which open-source classifier to deploy and how much GPU memory it will eat. The category has settled into a few real options: Meta's Llama Guard 3 family (1B and 8B), Google's ShieldGemma trio (2B, 9B, 27B), Microsoft's tiny Prompt Guard 86M, IBM's Granite Guardian 3 (2B and 8B), and Allen AI's research-grade WildGuard 7B. Pick wrong and you either burn an A100 to classify every chat turn, miss the jailbreak that ends up in a screenshot on X, or sign yourself up for a license review your legal team will hate. Before you commit to a sidecar architecture, run your expected inference volume through the vector DB cost per 1M embeddings calculator so the GPU math survives contact with the budget.

**Meta Llama Guard 3 8B** is the category-defining content classifier — fine-tuned on Llama 3, MLCommons hazard taxonomy, model card at https://huggingface.co/meta-llama/Llama-Guard-3-8B. **Google ShieldGemma 9B** is the per-category scoring alternative built on Gemma 2, model card at https://huggingface.co/google/shieldgemma-9b. **Microsoft Prompt Guard 86M** is the lightweight DeBERTa-based jailbreak and prompt-injection detector at https://huggingface.co/microsoft/Prompt-Guard-86M. **IBM Granite Guardian 3 8B** is the enterprise-friendly Apache 2.0 classifier with hallucination detection at https://huggingface.co/ibm-granite/granite-guardian-3.0-8b. **Allen AI WildGuard 7B** is the research-grade unified safety model at https://huggingface.co/allenai/wildguard. All model sizes, licenses, and capability claims in this guide come from the official Hugging Face model cards as of June 2026.

The rest of this comparison breaks down what each classifier actually does, what license you are signing up for, how latency scales on an A100, and which one to deploy for which use case. You will get an opinionated decision matrix, a five-step implementation plan, and answers to the questions your platform and security teams will ask. We also map this against managed providers in AI guardrails platforms compared and the practical defense work in prompt injection defense 2026.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Llama Guard 3 8B, ShieldGemma 2B, ShieldGemma 9B, Prompt Guard 86M, Granite Guardian 8B, WildGuard — capability and license overview, June 2026

Feature
Llama Guard 3 8B
ShieldGemma 2B
ShieldGemma 9B
Prompt Guard 86M
Granite Guardian 8B
WildGuard
Model size (parameters)8B (Llama 3.1 base)2B (Gemma 2 base)9B (Gemma 2 base)86M (DeBERTa-v3 base)8B (Granite 3.0 base)7B (Mistral 7B base)
LicenseLlama 3.1 Community LicenseGemma Terms of UseGemma Terms of UseApache 2.0Apache 2.0Apache 2.0
Hazard categories covered14 categories on MLCommons hazard taxonomy (S1-S14)4 harm types: harassment, hate, sexual, dangerous content4 harm types: harassment, hate, sexual, dangerous content3 labels: BENIGN, INJECTION, JAILBREAKHarm, social bias, jailbreak, violence, sexual, profanity, unethical, hallucination, function-call riskUnified harm taxonomy with prompt-harm and response-harm scoring
Language coverage8 languages: EN, FR, DE, HI, IT, PT, ES, THPrimarily English (Gemma 2 multilingual base)Primarily English (Gemma 2 multilingual base)Multilingual (DeBERTa multilingual)English primary; multilingual baseEnglish only (per WildGuard model card)
Jailbreak detectionIndirect — flags S14 'Code Interpreter Abuse' categoryNot a primary use caseNot a primary use caseYes — dedicated JAILBREAK labelYes — explicit jailbreak categoryYes — prompt-harm class includes adversarial prompts
Prompt-injection detectionIndirect — not a labeled categoryNot a primary use caseNot a primary use caseYes — dedicated INJECTION label, primary use caseLimited — covered under broader 'unethical' bucketLimited — covered under prompt-harm class
Approx latency on A100 (per request)~80-150 ms (8B, FP16)~20-40 ms (2B, FP16)~90-160 ms (9B, FP16)~3-8 ms (86M, FP16)~80-150 ms (8B, FP16)~70-130 ms (7B, FP16)
Fine-tunableYes — LoRA + full fine-tune; recipe publishedYes — standard Gemma fine-tuning toolingYes — standard Gemma fine-tuning toolingYes — small enough for CPU fine-tuneYes — full Apache 2.0, no use restrictionsYes — released with training data for reproducibility
Reported accuracy / benchmarkF1 ~0.92 on internal Meta hazard test set per https://huggingface.co/meta-llama/Llama-Guard-3-8BOptimal AU-PRC vs LlamaGuard1 baseline per https://huggingface.co/google/shieldgemma-2bHighest AU-PRC in family per https://huggingface.co/google/shieldgemma-9b92%+ on jailbreak detection per https://huggingface.co/microsoft/Prompt-Guard-86MOutperforms Llama Guard 3 on harm benchmark per https://huggingface.co/ibm-granite/granite-guardian-3.0-8bOutperforms GPT-4 on adversarial WildGuardTest per https://huggingface.co/allenai/wildguard
Quantized GGUF availableYes (community quants on HF)Yes (community quants on HF)Yes (community quants on HF)N/A — already tinyYes (IBM ships official GGUF)Yes (community quants on HF)
Built-in hallucination detectionNoNoNoNoYes — explicit faithfulness categoryNo
Best fitTeams already on Llama 3 stack who want MLCommons-aligned categoriesLatency-sensitive consumer chat with budget for 2B sidecarQuality-first production deployments on Gemma stackPre-classifier for jailbreak/injection at edge before main moderationEnterprise teams needing Apache 2.0 + hallucination scoring in one modelResearch, evaluation pipelines, and reproducibility-focused safety teams

Sources as of June 2026 — verify model cards before deployment: https://huggingface.co/meta-llama/Llama-Guard-3-8B, https://huggingface.co/google/shieldgemma-2b, https://huggingface.co/google/shieldgemma-9b, https://huggingface.co/microsoft/Prompt-Guard-86M, https://huggingface.co/ibm-granite/granite-guardian-3.0-8b, https://huggingface.co/allenai/wildguard. Open-weights licenses, hazard taxonomies, and reported benchmarks change with every model release — confirm in writing before any production rollout. Latency figures are typical A100 single-request inference at FP16; real-world numbers vary by serving stack.

What each classifier actually does (and the model-card claims to read carefully)

**Llama Guard 3 8B** is Meta's third-generation safety classifier, fine-tuned from Llama 3.1 8B on a hazard taxonomy aligned with MLCommons. It accepts a chat-format prompt — system, user, and optionally assistant turns — and returns a structured response indicating safe or unsafe plus the violated category code (S1 through S14). It is designed to classify both user prompts and assistant responses in the same model, which makes it a one-stop sidecar for many architectures. Per the official model card at https://huggingface.co/meta-llama/Llama-Guard-3-8B, it covers 14 hazard categories including violent crimes, sexual content, hate speech, suicide and self-harm, weapons, and a notable S14 category for code-interpreter abuse.

**ShieldGemma** is Google's family of three classifiers built on Gemma 2 — the 2B, 9B, and 27B sizes available at https://huggingface.co/google/shieldgemma-2b, https://huggingface.co/google/shieldgemma-9b, and https://huggingface.co/google/shieldgemma-27b. Unlike Llama Guard's category-code output, ShieldGemma emits a Yes/No token with a confidence probability for each of four harm types: harassment, hate speech, sexually explicit, and dangerous content. You query each harm separately, which is more compute but gives you a granular score per category — useful for orgs that want to tune thresholds per harm rather than accepting a single binary decision.

**Prompt Guard 86M** is a different beast. It is not a content classifier in the LlamaGuard sense — it is a tiny DeBERTa-v3 model fine-tuned specifically to detect prompt injection and jailbreak attempts at the input layer. Per https://huggingface.co/microsoft/Prompt-Guard-86M, it returns one of three labels: BENIGN, INJECTION, or JAILBREAK. At 86M parameters it runs in milliseconds on CPU and is designed as a cheap pre-filter before you spend tokens on a heavier classifier or your main model. Originally released by Meta under the Llama 3 Community License; the Microsoft-hosted copy on Hugging Face mirrors the same weights. Treat it as a screen, not a comprehensive policy engine.

**Granite Guardian 3 8B** is IBM's enterprise-grade classifier from the Granite 3.0 family, available under Apache 2.0 at https://huggingface.co/ibm-granite/granite-guardian-3.0-8b. Its differentiator is breadth: it covers harm, social bias, jailbreak, violence, sexual content, profanity, unethical behavior, plus explicit hallucination and function-calling risk categories. It is the only model on this list that natively scores hallucination — useful if you are running a RAG application and want a single sidecar to flag both unsafe and unfaithful outputs. IBM also ships a 2B version at https://huggingface.co/ibm-granite/granite-guardian-3.0-2b for latency-sensitive deployments.

**WildGuard 7B** is Allen AI's research-grade unified safety classifier, fine-tuned from Mistral 7B and released under Apache 2.0 at https://huggingface.co/allenai/wildguard. The release is paired with the WildGuardMix training dataset and WildGuardTest benchmark, which is the unusually transparent part — most safety classifiers ship as black boxes, but Allen AI published the data so you can audit and reproduce. WildGuard scores three things in one pass: whether the user prompt is harmful, whether the model response is harmful, and whether the model refused. It is the strongest fit for research and evaluation pipelines, where reproducibility matters more than enterprise SLAs.

The marketing claim to read carefully across all six: 'state of the art on harm classification.' Every vendor claims this, every vendor benchmarks against a slightly different test set, and the comparisons are not directly apples-to-apples. The IBM model card claims Granite Guardian outperforms Llama Guard 3 on aggregate harm benchmarks per https://huggingface.co/ibm-granite/granite-guardian-3.0-8b. The Allen AI paper claims WildGuard outperforms GPT-4 on the WildGuardTest set per https://huggingface.co/allenai/wildguard. Both are true on their respective evaluations. Neither tells you what will happen on your traffic. Run your own eval before you commit.


License and commercial-use reality: what you are actually signing

Open-weights is not the same as Apache 2.0. **Llama Guard 3 8B** is released under the Llama 3.1 Community License, which is permissive for most commercial use but contains two clauses worth reading carefully: the 700M monthly-active-users restriction (you need a separate license from Meta if you exceed it) and the attribution requirement (you must include 'Built with Llama' notice in derivative products). Most startups will never hit the MAU threshold, but if you are inside a hyperscaler or a top-10 social platform, your legal team needs to read the full text at https://huggingface.co/meta-llama/Llama-Guard-3-8B/blob/main/LICENSE before deployment.

**ShieldGemma** is released under the Gemma Terms of Use at https://ai.google.dev/gemma/terms. This license is more restrictive than Apache 2.0 — it includes a prohibited-use policy at https://ai.google.dev/gemma/prohibited_use_policy that covers a long list of restricted applications. Critically, Google reserves the right to update the prohibited-use policy, and your continued use of the model implies acceptance of updated terms. For most teams this is fine. For teams operating in regulated industries or building products that might brush against ambiguous categories (security research, red-teaming, content moderation tooling itself), the prohibited-use policy is worth a real legal review.

**Prompt Guard 86M** sits in a slightly odd place. Meta originally released it under the Llama 3 Community License, and the Hugging Face mirror at https://huggingface.co/microsoft/Prompt-Guard-86M reflects that origin. Despite being hosted under the microsoft namespace, the license inherits from Meta. The 700M-MAU restriction technically applies, though the practical risk is low because the model is small and the use case (input filtering) is narrow. Still — verify the exact license on the model card at the moment you pull it; both Meta and Microsoft have been known to update licensing terms between model versions.

**Granite Guardian 3 8B** and the 2B sibling are released under Apache 2.0 at https://huggingface.co/ibm-granite/granite-guardian-3.0-8b — the most permissive license on this list. No use restrictions, no MAU caps, no prohibited-use policy that can change under you. For enterprise procurement, this is materially valuable. Your legal team's review of an Apache 2.0 model takes hours; their review of the Gemma or Llama license can take weeks if they have not already done it for another project. If license headache reduction is a real priority, Granite Guardian is the cleanest answer.

**WildGuard 7B** is also Apache 2.0 per https://huggingface.co/allenai/wildguard. Combined with the published training data (WildGuardMix), it is the most auditable option for organizations that need to demonstrate to regulators or customers exactly how the safety layer was trained. The trade-off is that Allen AI is a research lab, not an enterprise vendor — there is no SLA, no patch cadence, no support contract. Treat it as a strong reference implementation, not a vendor relationship.

The pragmatic license ranking for 2026 commercial deployments: Apache 2.0 (Granite Guardian, WildGuard) is the easiest legal sell, the Llama 3 Community License (Llama Guard, Prompt Guard) is fine for almost everyone except the largest platforms, and the Gemma Terms of Use (ShieldGemma) is fine if your legal team has already cleared Gemma for another project. If your legal review velocity is the bottleneck, start with the Apache 2.0 options. If your engineering team is already on a Llama or Gemma stack, the friction of staying in-family usually beats the friction of switching license families.


Architecture: how to wire a safety classifier into an LLM application

The standard sidecar architecture in 2026 puts a safety classifier on both sides of the main LLM call: input classification before the prompt hits your generation model, output classification before the response hits the user. **Llama Guard 3** and **Granite Guardian** support this dual-direction classification natively — you pass them a conversation including the assistant turn and they return the same structured verdict. **ShieldGemma** can do the same but you query four harm types per direction, so the call count multiplies. **Prompt Guard** only handles the input side. **WildGuard** scores prompt and response in a single call.

Latency budgets matter. A real-time chat application has roughly 300 ms of headroom for safety classification before users start noticing lag. **Prompt Guard 86M** at 3-8 ms per request on an A100 fits trivially. **ShieldGemma 2B** at 20-40 ms fits comfortably. **Llama Guard 3 8B**, **Granite Guardian 8B**, **ShieldGemma 9B**, and **WildGuard 7B** all land in the 70-160 ms range — fine as a single-direction classifier but starting to bite if you call them twice per turn. The standard pattern is to run Prompt Guard as a cheap edge pre-filter, then escalate to a bigger classifier only when the small one is uncertain or flags a category that needs nuance.

Serving stack choice is roughly as impactful as model choice. vLLM, TGI, and SGLang all support the major architectures (Llama, Gemma, Granite, Mistral) and can multiplex safety classifier requests with main inference. For a production deployment, batching is the lever — a single A100 running vLLM can serve roughly 200-400 Llama Guard 3 8B requests per second with continuous batching, versus 20-40 requests per second naive. Read https://docs.vllm.ai/en/latest/ for the canonical batching guidance before sizing your fleet.

Edge deployment changes the math. **Prompt Guard 86M** is small enough to run on CPU at single-digit milliseconds, which means you can deploy it at the API gateway layer (Cloudflare Workers AI, Vercel Edge, AWS Lambda) before any GPU request is dispatched. This is the cheapest possible architecture for input filtering — you reject obvious jailbreaks for the cost of a CPU inference, and only pay GPU cost for traffic that passes the screen. Many production teams in 2026 run Prompt Guard at the edge and Llama Guard 3 or Granite Guardian in the application tier.

Quantization is the second lever. The community has published 4-bit and 8-bit GGUF quants for every model on this list at https://huggingface.co/models?search=guard+gguf. A 4-bit quant of Llama Guard 3 8B fits in roughly 5 GB of VRAM and runs on consumer GPUs (RTX 4090, A10G), opening up dev-environment and on-prem deployments that would otherwise need datacenter hardware. Expect a 1-3 percent F1 degradation from quantization — measurable but rarely material for safety classification, where you tune thresholds anyway.

The architectural mistake to avoid: using the same model for generation and classification. Some teams have asked their main GPT-4o or Claude or Llama 3 70B model to also moderate its own output via a meta-prompt. This works for prototypes and fails at production scale because (a) the cost per moderated turn doubles, (b) you lose the ability to log moderation decisions separately for audit, and (c) the failure modes correlate — a jailbreak that compromises the generator probably also compromises the meta-prompt moderator. Use a separate, smaller, purpose-built classifier.


Benchmarks and the trust-but-verify problem

Every model card on this list reports favorable benchmark numbers, and every one of them is technically correct on their chosen test set. **Llama Guard 3 8B** reports F1 around 0.92 on Meta's internal hazard test set per https://huggingface.co/meta-llama/Llama-Guard-3-8B, which is competitive with the ShieldGemma family on the same categories. **ShieldGemma** reports area-under-precision-recall-curve (AU-PRC) numbers favorable versus the original LlamaGuard 1 baseline per https://huggingface.co/google/shieldgemma-2b. **Granite Guardian 3 8B** reports outperforming Llama Guard 3 on aggregate harm benchmarks per https://huggingface.co/ibm-granite/granite-guardian-3.0-8b.

**WildGuard 7B** is the most transparent benchmark story — the WildGuardTest evaluation set is publicly available and the WildGuard paper at https://arxiv.org/abs/2406.18495 reports outperforming GPT-4 on adversarial prompts. This is meaningful, but the test set is itself a research artifact, and adversarial prompts are a fast-moving target. A classifier that wins on 2024-era jailbreaks may struggle on 2026-era ones, and vice versa. Treat published benchmark wins as a directional signal, not a guarantee.

**Prompt Guard 86M** reports better than 92 percent jailbreak detection accuracy per https://huggingface.co/microsoft/Prompt-Guard-86M. This number is real but contextual — it is measured on a curated test set, and real-world jailbreaks include novel techniques (e.g., translation-based attacks, base64 encoding, role-play wrappers) where the model is less reliable. Use Prompt Guard as a high-recall first-line filter, then accept that some attacks will pass through to deeper classification.

The published research worth reading before committing to a model: the Llama Guard paper at https://arxiv.org/abs/2312.06674 (original LlamaGuard), the ShieldGemma technical report at https://arxiv.org/abs/2407.21772, the WildGuard paper at https://arxiv.org/abs/2406.18495, and the IBM Granite Guardian report at https://arxiv.org/abs/2412.07724. Each describes methodology and limitations honestly enough that you can form an opinion before you deploy.

An MLPerf-style standardized safety benchmark does not yet exist as of June 2026. The closest thing is the MLCommons AI Safety v0.5 benchmark suite at https://mlcommons.org/benchmarks/ai-safety/, which is the hazard taxonomy Llama Guard 3 aligns to. Until a v1 with broader vendor participation lands, cross-vendor benchmark comparisons are best treated as suggestive. The OpenAI Moderation API and Azure Content Safety are notably absent from open benchmarks because they are closed-weights — direct comparison requires building your own eval harness on traffic you control.

The correct posture in 2026: pick two models from this list, build a 1,000-prompt evaluation set drawn from your actual application traffic plus public adversarial prompts (Anthropic's HH-RLHF red-team subset at https://huggingface.co/datasets/Anthropic/hh-rlhf is a reasonable starting point), and measure precision and recall on your data. Vendor benchmarks tell you the model can perform; your own eval tells you it will perform for your use case. Budget a real engineering week for this work — it is the single most valuable thing you will do before signing up to operate a safety sidecar in production.


Real use-case decision matrix: which classifier to deploy where

If you are already running Llama 3 inference and you want a safety sidecar that aligns to MLCommons hazard categories with the broadest language coverage, deploy **Llama Guard 3 8B**. The 14-category taxonomy maps cleanly to most enterprise content policies, the multilingual support (EN, FR, DE, HI, IT, PT, ES, TH) is the best on this list, and the dual-direction prompt+response classification simplifies the architecture. Verify the model card at https://huggingface.co/meta-llama/Llama-Guard-3-8B before deployment. For latency-sensitive paths, swap in Llama Guard 3 1B from https://huggingface.co/meta-llama/Llama-Guard-3-1B.

If you want per-category confidence scores so you can tune thresholds independently for harassment, hate, sexual, and dangerous content — and you are already on a Gemma stack — deploy **ShieldGemma 9B**. The 9B size is the sweet spot for accuracy; the 2B variant is the right pick when latency dominates and you are willing to trade some precision. Both are documented at https://huggingface.co/google/shieldgemma-9b and https://huggingface.co/google/shieldgemma-2b. ShieldGemma 27B at https://huggingface.co/google/shieldgemma-27b exists but the marginal accuracy over 9B rarely justifies the GPU cost.

If you need to detect prompt injection and jailbreak attempts cheaply at the edge before any GPU inference fires, deploy **Prompt Guard 86M**. It is the only model on this list that is small enough to run on CPU at single-digit milliseconds, which makes it the right choice for the input-layer screen in front of your main LLM. Pair it with a heavier classifier downstream — Prompt Guard is a screen, not a complete policy. Model card at https://huggingface.co/microsoft/Prompt-Guard-86M.

If you are an enterprise that needs an Apache 2.0 license, broad category coverage including hallucination detection, and the option of a 2B-or-8B size choice within one family, deploy **Granite Guardian 3 8B** for accuracy or **Granite Guardian 3 2B** for latency. The hallucination category is a real differentiator for RAG applications — no other open-source safety classifier on this list scores faithfulness natively. Model cards at https://huggingface.co/ibm-granite/granite-guardian-3.0-8b and https://huggingface.co/ibm-granite/granite-guardian-3.0-2b.

If you are running a research or evaluation pipeline where reproducibility and auditability matter more than vendor SLAs, deploy **WildGuard 7B**. The published training data and benchmark set let you trace every classification decision back to a transparent dataset. It is the right pick for academic work, regulator-facing audits, and red-team evaluation harnesses. For pure production use cases, the other options have more deployment polish. Model card at https://huggingface.co/allenai/wildguard.

The honest worst-case scenario: most teams in 2026 should run two classifiers in series, not one. Prompt Guard 86M at the edge for cheap input screening, then Llama Guard 3 8B or Granite Guardian 8B in the application tier for full prompt+response classification. This costs roughly 20 percent more inference than a single classifier but materially improves both latency (cheap reject path) and precision (specialized models for specialized jobs). Skipping the edge screen and running an 8B classifier on every turn is the most common over-spend pattern.


Build vs. buy: when to skip open-source and use a managed safety API

Some teams ask whether they should skip the open-source classifier route and use OpenAI's Moderation API at https://platform.openai.com/docs/guides/moderation, Azure AI Content Safety at https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety, AWS Bedrock Guardrails at https://aws.amazon.com/bedrock/guardrails/, or Anthropic's built-in safety features documented at https://www.anthropic.com/safety. For most low-volume early-stage applications, the answer is yes — managed APIs are zero-ops, well-supported, and the per-request cost is often cheaper than running your own GPU when traffic is sparse.

The math flips at scale. OpenAI's Moderation API is free for OpenAI customers at the time of writing, but it only classifies content against OpenAI's content policy — which may not match your product policy. Azure AI Content Safety is priced per 1,000 text records (verify at https://azure.microsoft.com/en-us/pricing/details/cognitive-services/content-safety/) and the cost adds up fast in high-volume chat applications. AWS Bedrock Guardrails is priced per 1,000 text units (verify at https://aws.amazon.com/bedrock/pricing/). For a 50,000-DAU chat app with 20 turns per session, you are looking at 1M+ moderation calls per day; the managed API bill can easily exceed the GPU cost of a self-hosted Llama Guard 3 deployment.

Policy customization is the other lever. Managed APIs ship with fixed categories. **Llama Guard 3** can be fine-tuned on your specific policy categories using the recipe at https://github.com/meta-llama/llama-cookbook/tree/main/recipes/responsible_ai. **ShieldGemma** supports custom-policy prompting where you pass the harm definition at inference time. **Granite Guardian** supports both fine-tuning and prompted policies. If your content policy includes categories that do not map to vendor defaults — say, brand-specific prohibited mentions, regulated-industry disclosure rules, or platform-specific norms — open-source customization is meaningfully better than trying to bend a managed API.

Latency is the third lever. Round-tripping every chat turn to a managed safety API adds 50-200 ms of network latency on top of inference. Self-hosting the classifier in the same VPC or on the same GPU node as your main LLM eliminates that round-trip. For real-time voice or low-latency chat use cases, the architectural simplicity of co-located inference often outweighs the operational complexity of running open weights. Verify with your specific provider — Cloudflare's hosted Llama Guard at https://developers.cloudflare.com/workers-ai/models/ runs at edge and can be faster than OpenAI Moderation from many regions.

The hybrid pattern that works in 2026: use a managed API for your first 100,000 monthly requests while you validate your policy and traffic shape, then migrate to a self-hosted open-source classifier once volume and category-specificity justify the engineering work. Most teams that try to skip the managed-API phase end up under-investing in their own evaluation harness and ship a worse classifier than the managed alternative.

The bottom line on build-vs-buy: managed APIs are the right starting point for sub-100k monthly request volumes, custom-policy products at any volume should run open-source classifiers with fine-tuning, and very high-volume products (10M+ monthly requests) almost always end up running their own infrastructure regardless of where they started. If you are evaluating the per-token economics, the OpenAI API cost calculator handles the moderation-call math alongside your generation costs.


Implementation timeline: 30, 60, 90 days to a real safety sidecar

Days 1 to 10: pick a model and stand up inference. Pull either Llama Guard 3 8B, Granite Guardian 8B, or ShieldGemma 9B onto a development GPU (an A10G or single A100 is sufficient for testing). Serve it with vLLM or TGI behind a basic HTTP endpoint. Send a curl request, verify you get a structured classification back. Most teams underestimate this step — getting a vLLM container running with the right flash-attention version and correctly-sized KV cache takes a real day, not an hour. Reference the vLLM quickstart at https://docs.vllm.ai/en/latest/getting_started/quickstart.html.

Days 11 to 25: build the evaluation harness. Assemble a 1,000-prompt test set drawn from your application traffic (anonymized), plus public adversarial sets like the Anthropic HH-RLHF red-team subset, plus your custom policy violations. Score precision and recall by category on your candidate classifier. Compare against the OpenAI Moderation API as a baseline — even if you do not plan to use it in production, it is a useful reference. Document the failure modes — Llama Guard's specific weaknesses on translation-based attacks, ShieldGemma's high false-positive rate on benign sexual-health content, etc. — so you can decide on fine-tuning scope.

Days 26 to 45: wire into the application. Add the classifier as a sidecar call in your request handler — typically a pre-classifier on the user prompt, then a post-classifier on the model response. Decide on the action policy for each verdict (block, redact, escalate to human review, log only). Add structured logging of every classification decision with the prompt hash, classifier verdict, action taken, and timestamp — these logs are critical both for ongoing tuning and for any future audit. Build a small ops dashboard so you can see classification volume, block rate, and category distribution at a glance.

Days 46 to 65: run a shadow deployment. Mirror live production traffic to the classifier without acting on the verdicts. Compare the classifier's decisions to current production behavior (which may be no moderation, or a different model). Identify the categories where the new classifier disagrees with your current behavior and triage them — some disagreements will be the new classifier catching real harms, some will be false positives that need threshold tuning or category-specific exemptions. This step is the single biggest determinant of launch success or failure.

Days 66 to 80: progressive rollout. Enable the classifier for 1 percent of traffic with full blocking action. Monitor false-positive complaints and user-facing errors. Ramp to 10 percent at day 70, 50 percent at day 75, 100 percent at day 80 if metrics hold. Keep the kill switch trivially accessible — every safety sidecar deployment in production should have a single-flag bypass for the case where the classifier itself misbehaves. Operationally, the classifier is a critical-path dependency; treat it like one.

Days 81 to 90: tune and document. Review classification logs from full rollout, identify the highest-volume false positives and false negatives, decide whether to retune thresholds, fine-tune the model on a custom dataset, or accept the current performance. Document the deployment architecture, the eval results, the rollout playbook, and the kill-switch procedure for the on-call team. The classifier is now a production system — it needs an owner, a runbook, and a quarterly review cadence. The teams that skip this last step end up with a classifier that nobody updates while jailbreak techniques evolve underneath them.


The opinionated 2026 pick: what I would deploy

If I were architecting a new LLM application tomorrow with a real budget and a real platform team, I would deploy **Prompt Guard 86M** at the edge plus **Llama Guard 3 8B** in the application tier. Prompt Guard catches the cheap obvious attacks for the cost of a CPU inference; Llama Guard handles the full policy classification on the traffic that passes the screen. The combined latency is roughly 100-160 ms per moderated turn, the GPU footprint is modest, and the MLCommons-aligned hazard categories are the easiest sell to a security or trust-and-safety leader who needs to explain the architecture upstream. Verify both at https://huggingface.co/microsoft/Prompt-Guard-86M and https://huggingface.co/meta-llama/Llama-Guard-3-8B.

If license headache is the dominant constraint — for instance, you are inside a regulated industry, your legal team has not approved either the Llama 3 Community License or the Gemma Terms of Use, and you cannot afford a 6-week procurement cycle — I would deploy **Granite Guardian 3 8B** under Apache 2.0 plus a fine-tuned version of the 2B for latency-sensitive paths. The added bonus is the built-in hallucination category, which is the right architecture if you are running RAG anywhere in the stack. Model cards at https://huggingface.co/ibm-granite/granite-guardian-3.0-8b and https://huggingface.co/ibm-granite/granite-guardian-3.0-2b.

If I were already heavily invested in the Google ecosystem — running Vertex AI, building on Gemma 2, using Google Cloud as the primary platform — I would deploy **ShieldGemma 9B** plus Prompt Guard 86M at the edge. Staying within the Gemma family reduces operational variance, and the per-category confidence scores are genuinely useful for products where you want to tune harassment thresholds differently from sexual-content thresholds. Verify ShieldGemma at https://huggingface.co/google/shieldgemma-9b and the Gemma terms at https://ai.google.dev/gemma/terms before commit.

If I were running a research lab, an evaluation harness, or anything that will face external audit — academic publication, government regulator, customer security review — I would deploy **WildGuard 7B** as the primary classifier. The published training data is the differentiator no other model on this list offers, and it is the only one where you can credibly answer the question 'show me the data this model was trained on' without a vendor NDA. Model card at https://huggingface.co/allenai/wildguard. Pair with a heavier production classifier if you also need enterprise-grade latency.

The one configuration I would actively avoid in 2026 is running a single classifier on every chat turn without a cheap pre-filter. Spending an 8B GPU inference on every benign 'what is the capital of France' prompt is wasteful and noticeably degrades the end-user latency on the long-tail of slow requests. The Prompt-Guard-at-edge pattern is essentially free engineering effort for a measurable latency improvement and a meaningful GPU-cost reduction at scale. If you remember one architectural principle from this article, that is the one.

The other thing I would not do is treat 'safety classifier' and 'content moderation' as a solved problem because you deployed one of these models. The model is a tool. The policy decisions — what counts as harmful, how to handle false positives, how to escalate edge cases — are the actual work, and they are organizational decisions that no open-source model can make for you. Deploy the classifier, then invest the same engineering effort in the human policy and review workflow on top. The classifier without the policy is a liability; the policy without the classifier is too expensive to operate at scale. You need both.

How to pick and deploy an open-source LLM safety classifier for your team

  1. 1

    Step 1: Define your content policy before you pick a model

    Write a one-page document listing every category your safety layer needs to enforce, with concrete examples of what should be blocked and what should be allowed. Map each policy category to one or more model output categories — does Llama Guard 3's S6 (specialized advice) cover your medical-disclaimer policy, or do you need a custom fine-tune? Does ShieldGemma's four-category model handle your harassment definition, or are you carving harassment differently? This document drives the model choice, not the other way around. Teams that pick a model first and write the policy later end up shoehorning their policy into the model's defaults, which produces both false positives that frustrate users and false negatives that surface in incident postmortems. Spend the day on the policy document before you spin up the first GPU. If you cannot articulate the policy in plain English, no classifier will save you.

  2. 2

    Step 2: Build a real evaluation harness on your traffic

    Assemble a 1,000-prompt test set drawn from three sources: anonymized samples of your actual production traffic (this is the most important), public adversarial sets like the Anthropic HH-RLHF red-team subset at https://huggingface.co/datasets/Anthropic/hh-rlhf, and synthetic edge cases generated against your specific policy categories. Run each candidate classifier against the test set and measure precision and recall by category. Critically, also measure agreement with your current production behavior, whether that is a managed API, a different open-source model, or no moderation at all. The disagreement set is where the real engineering work lives. Budget a real engineering week for this — it is the single most valuable activity in the entire deployment process and the most commonly skipped.

  3. 3

    Step 3: Stand up the inference stack in shadow mode first

    Deploy your chosen classifier on a vLLM, TGI, or SGLang server behind an internal endpoint, then mirror live production traffic to it without acting on the verdicts. Run shadow mode for at least 2 weeks. Compare the classifier's decisions to current production behavior, log every disagreement, and review the highest-volume false positives and false negatives with your trust-and-safety team. This is where you will discover the surprises — the benign prompts the classifier flags, the harmful prompts it misses, the categories where the model output is unreliable enough to require fine-tuning. Shadow mode is the deployment phase where bad ideas get caught before they hit users. Resist any pressure to skip this phase to hit a launch date. Verify your serving stack at https://docs.vllm.ai/en/latest/getting_started/quickstart.html.

  4. 4

    Step 4: Progressive rollout with a kill switch wired in from day one

    Enable the classifier for 1 percent of production traffic with full blocking action. Monitor false-positive complaints, user-facing error rates, and downstream pipeline latency. Ramp to 10 percent at day 5, 50 percent at day 10, 100 percent at day 15 if and only if metrics hold. Keep the kill switch trivially accessible — a feature flag, a single config value, a one-line code change that the on-call engineer can revert at 3 AM. Every safety sidecar in production should have a documented bypass procedure for the case where the classifier itself misbehaves (model server down, latency spike, false-positive flood). Treat the classifier as a critical-path dependency from day one, because that is what it is. Plan capacity for at least 2x peak traffic to absorb spikes without throttling.

  5. 5

    Step 5: Establish a quarterly review cadence and an owner

    The classifier is a living system. Jailbreak techniques evolve, content policy changes, your traffic mix shifts, and model providers ship new versions every 3 to 6 months. Designate a named owner (typically on the platform or trust-and-safety team), schedule a quarterly review meeting, and treat it as a real engineering review with real action items. At each review: pull the false-positive and false-negative samples from the prior quarter, decide whether to retune thresholds or fine-tune the model on new examples, check for new model releases from Meta, Google, Microsoft, IBM, and Allen AI worth evaluating, and update the runbook. The teams that skip this step end up with a classifier that nobody updates while attack techniques drift underneath them — and discover the gap when a screenshot ends up on X. Verify model card updates at https://huggingface.co/meta-llama/Llama-Guard-3-8B and https://huggingface.co/ibm-granite/granite-guardian-3.0-8b each quarter.

Frequently Asked Questions

Is Llama Guard 3 8B really better than ShieldGemma 9B for production deployments?

Not categorically — they are designed for different workflows. **Llama Guard 3 8B** returns a single structured verdict with the violated category code from a 14-category MLCommons-aligned taxonomy, which is the right output shape for most enterprise content policies. **ShieldGemma 9B** returns per-category confidence scores across four harm types, which is better when you need to tune thresholds independently per category. Llama Guard wins on language coverage (8 languages versus ShieldGemma's English-first design) and category breadth. ShieldGemma wins on per-category granularity and on teams already invested in the Gemma stack. Verify both at https://huggingface.co/meta-llama/Llama-Guard-3-8B and https://huggingface.co/google/shieldgemma-9b. As of June 2026, the published cross-vendor benchmarks remain non-comparable, so run your own evaluation on your traffic before committing.

Can I use Prompt Guard 86M as my only safety classifier, or do I need a bigger model alongside it?

Prompt Guard 86M is a screen, not a complete policy engine. It classifies prompts as BENIGN, INJECTION, or JAILBREAK — that is all. It does not classify content against harm categories (hate, sexual, violence, etc.) and it does not classify model outputs at all. Per https://huggingface.co/microsoft/Prompt-Guard-86M, the model is explicitly designed as a lightweight pre-filter, not a full moderation stack. The right architecture is Prompt Guard at the edge for cheap input screening plus a heavier classifier (Llama Guard 3 8B, ShieldGemma 9B, or Granite Guardian 8B) in the application tier for full policy enforcement. Using Prompt Guard alone leaves you exposed on the output-classification side and on harm-category policy enforcement.

Which safety classifier has the most permissive license for commercial use?

**IBM Granite Guardian 3** (both 2B and 8B) and **Allen AI WildGuard 7B** are both released under Apache 2.0, which is the most permissive license on this list. No use restrictions, no MAU caps, no prohibited-use policy that can be updated unilaterally by the vendor. **Llama Guard 3** uses the Llama 3.1 Community License, which is permissive for most commercial use but includes a 700M-MAU restriction and an attribution requirement. **ShieldGemma** uses the Gemma Terms of Use which includes a prohibited-use policy that Google can update. For procurement teams that need to clear license review quickly, Apache 2.0 is the fastest path. Verify the current license on the model card at the moment you pull weights, since vendors have updated terms between releases.

What is the realistic latency overhead of running a safety sidecar on every chat turn?

On an A100 with FP16, single-request latency runs roughly 3-8 ms for Prompt Guard 86M, 20-40 ms for ShieldGemma 2B, and 70-160 ms for the 7-9B models (Llama Guard 3 8B, ShieldGemma 9B, Granite Guardian 8B, WildGuard 7B). For a typical chat application, plan on 100-200 ms of total moderation overhead per turn if you run input-plus-output classification through an 8B model. Batching via vLLM dramatically improves throughput (200-400 requests per second per A100) but does not reduce single-request latency. The Prompt-Guard-at-edge plus heavier-classifier-in-tier architecture is the standard pattern for keeping per-turn latency under 200 ms while maintaining full policy coverage. Verify your serving stack at https://docs.vllm.ai/en/latest/.

Do these classifiers detect hallucination, or just unsafe content?

Only **IBM Granite Guardian 3** explicitly includes a hallucination category in its standard output — the model card at https://huggingface.co/ibm-granite/granite-guardian-3.0-8b documents a faithfulness/groundedness score useful for RAG applications. The other classifiers on this list (Llama Guard 3, ShieldGemma, Prompt Guard, WildGuard) are policy-and-harm classifiers and do not score factuality natively. If you need both safety and hallucination detection in a single sidecar, Granite Guardian is the right pick. Otherwise, pair your safety classifier with a separate hallucination-detection model (Vectara HHEM at https://huggingface.co/vectara/hallucination_evaluation_model is a common choice) or use an LLM-as-judge pattern against your retrieved context.

How often should I re-evaluate or re-fine-tune my safety classifier in production?

Quarterly at minimum, monthly if you are in a high-volume consumer-facing product or a high-regulatory-risk vertical. Jailbreak techniques and adversarial prompt patterns evolve faster than model release cycles — a classifier that was state-of-the-art in Q1 2026 may have measurable gaps by Q4. At each review, pull the false-positive and false-negative samples logged from production, check for new releases from Meta (https://huggingface.co/meta-llama), Google (https://huggingface.co/google), Microsoft (https://huggingface.co/microsoft), IBM (https://huggingface.co/ibm-granite), and Allen AI (https://huggingface.co/allenai) worth evaluating, and decide on threshold retuning or fine-tuning scope. Treat the classifier as a system you operate, not a model you deploy once.

Can I run any of these classifiers on CPU or consumer hardware?

**Prompt Guard 86M** runs comfortably on CPU at single-digit milliseconds and is the right choice for edge deployment (Cloudflare Workers AI, Vercel Edge, AWS Lambda). 4-bit quantized versions of **Llama Guard 3 8B**, **ShieldGemma 9B**, **Granite Guardian 8B**, and **WildGuard 7B** all fit in roughly 5-6 GB of VRAM and run on consumer GPUs like the RTX 4090 or A10G. Latency on consumer hardware is roughly 2-4x slower than A100 numbers, which is fine for development and on-prem deployments but rarely acceptable for high-volume production. Community GGUF quants for every model on this list are available at https://huggingface.co/models?search=guard+gguf. Expect a 1-3 percent F1 degradation from 4-bit quantization, which is usually immaterial relative to threshold-tuning gains.

Should I just use OpenAI's Moderation API instead of running my own classifier?

For sub-100,000-monthly-request volumes and standard content policies that map cleanly to OpenAI's categories, yes — the Moderation API at https://platform.openai.com/docs/guides/moderation is free for OpenAI customers, well-supported, and zero-ops. The math flips when (a) your traffic exceeds 1M monthly moderation calls, (b) your content policy includes categories not covered by OpenAI's defaults (brand-specific prohibitions, regulated-industry rules, platform-specific norms), or (c) you need to control latency tightly via co-located inference. Most production teams in 2026 start with the OpenAI Moderation API or Azure AI Content Safety, then migrate to a self-hosted open-source classifier when volume or customization justifies the engineering work. Skipping the managed phase usually leads to under-investment in evaluation tooling.

What is the most common mistake teams make when deploying an open-source safety classifier?

Running an 8B classifier on every chat turn without a cheap pre-filter. The architecture pattern that works in 2026 is **Prompt Guard 86M** at the edge for cheap input screening (3-8 ms on CPU) followed by **Llama Guard 3 8B** or **Granite Guardian 8B** in the application tier only on traffic that passes the screen. This reduces GPU cost by 30-50 percent in typical traffic mixes and noticeably improves long-tail latency. The second most common mistake is treating the classifier as a one-time deployment rather than a system that needs quarterly review against evolving jailbreak techniques. The third is skipping shadow-mode evaluation and going straight to blocking action — the false-positive surprises always exceed the pre-deployment estimate. Verify the cookbook recipes at https://github.com/meta-llama/llama-cookbook before committing.

You now know which open-source safety classifier to deploy. Now make every prompt your LLM stack runs actually hit.

AI Prompt Generator builds production-ready system prompts that work across ChatGPT, Claude, Gemini, Llama Guard sidecars, ShieldGemma pipelines, and every other AI tool in this article — so your red-team evals and safety reviews get sharper data, not generic AI fluff. Stop tweaking prompts by hand and start shipping prompts that drive measurable lift. 14-day free trial, no credit card required.

Browse all prompt tools →