Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

OpenAI Safety Features Explained: Moderation API, Preparedness Framework, System Cards, Whisper, DALL-E 3, and Azure Overlays — Real Coverage, Real Trade-offs (2026)

OpenAI ships a six-layer safety stack in 2026 and almost no one uses all of it. The free Moderation API (omni-moderation-latest) catches the obvious stuff. The Preparedness Framework governs frontier-model release. System cards tell you what was red-teamed and what wasn't. Whisper has its own quirks. DALL-E 3 ships C2PA plus invisible watermarks. Azure OpenAI bolts on a configurable Content Filter and ZDR by default. Sources cited inline, June 2026.

By DDH Research Team at Digital Dashboard HubUpdated

If you build on OpenAI in 2026, your trust-and-safety story is not one product — it is six different products glued together, plus whatever you add on top. The free Moderation API at https://platform.openai.com/docs/guides/moderation catches obviously unsafe inputs and outputs. The model itself refuses sensitive requests through trained safety policies described in each system card. The Preparedness Framework at https://openai.com/safety/preparedness/ governs whether frontier models like GPT-5 or o3 even ship. Whisper and DALL-E 3 have their own safety layers. And Azure OpenAI Service wraps the whole thing in a configurable Content Filter that most enterprise buyers default to. Pick the wrong combination and you either ship unsafe content to users or you ship a product so refusal-heavy it cannot answer basic questions. Before you finalize the safety architecture, sanity-check the cost side with the OpenAI API cost calculator so the moderation latency and token overhead survive contact with the P99 budget.

The **Moderation API** (omni-moderation-latest) is OpenAI's free classifier for text and image inputs, covering 13 harm categories per https://platform.openai.com/docs/guides/moderation. The **o-series refusal layer** (o1, o3, o3-mini) is the trained safety behavior baked into the model weights themselves, documented in the o1 and o3 system cards at https://openai.com/safety/. **Azure OpenAI Content Filter** is Microsoft's configurable overlay with severity levels (safe/low/medium/high) per https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter. **Whisper** ships with separate safety considerations around hallucinated transcripts and disallowed-use categories. **DALL-E 3** embeds C2PA content credentials and invisible watermarks per https://help.openai.com/en/articles/8912793-c2pa-in-dall-e-3. **Custom GPT actions** have their own moderation and data exfiltration controls described at https://openai.com/index/introducing-the-gpt-store/. All capability and pricing details in this guide are sourced from OpenAI's documentation, system cards, and Microsoft Learn as of June 2026.

The rest of this guide breaks down what each layer actually does, where it plugs into your stack, what it costs (mostly nothing for the API, real money in latency), and which combinations to ship for which risk tier. You will get a decision matrix, a security-review checklist, a five-step implementation plan, and answers to the nine questions your compliance team will ask. We also compare these safety layers against alternatives in Anthropic's Constitutional AI explained and against the broader landscape in safety features GPT vs Claude vs Gemini.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

OpenAI Moderation API, o-series refusal, Azure Content Filter, Custom GPT actions, Whisper, DALL-E 3 — coverage + trade-offs, June 2026

Feature
OpenAI Moderation API
OpenAI o-series refusal
Azure OpenAI Content Filter
Custom GPT actions safety
Whisper safety
DALL-E 3 safety
PricingFree (omni-moderation-latest)Included in model token priceIncluded in Azure OpenAI deployment costIncluded in ChatGPT Plus/Team/Enterprise$0.006/min transcription (safety included)$0.040-$0.080/image (safety included)
Primary use casePre-screen user inputs + post-screen model outputsIn-model refusal of disallowed requestsConfigurable enterprise filter with severity tiersLimit data exfiltration + URL allow-listing for actionsDisclaim hallucinations + flag disallowed audio usesBlock disallowed image generation + provenance
False-positive rate (qualitative)Low for English; moderate for non-English nuanceModerate — over-refuses borderline medical/legal queriesTunable per category and severity — lowest if dialed inLow — narrow scope (data + URL access only)Low — focuses on transcript accuracy, not refusalModerate — strict on real-person likeness and brand
Category coverage13 categories incl. sexual/minors, harassment, self-harm, hate, violence (per docs)Categories per system card (CBRN, persuasion, autonomy, cybersecurity)Hate, sexual, violence, self-harm + jailbreak + protected material + groundednessData leakage to external APIs + URL allow-list enforcementDisallowed-use policy categories (not output classification)Real people, public figures, copyrighted style, violence, sexual
Log retentionInputs not used for training; 30-day abuse-monitoring retention defaultSame as parent API: 30-day abuse-monitoring defaultConfigurable; ZDR opt-in available for approved customersConversations retained per ChatGPT Enterprise policy30-day abuse-monitoring default; ZDR available on API30-day abuse-monitoring default; ZDR available on API
Zero Data Retention (ZDR)Available on request via OpenAI API enterprise tierAvailable on OpenAI API enterprise tierAvailable by default for approved Azure tenantsAvailable on ChatGPT Enterprise + EduAvailable on OpenAI API enterprise tierAvailable on OpenAI API enterprise tier
CustomizationThreshold tuning per category; no custom categoriesSystem prompt overrides limited; cannot disable safety trainingPer-deployment severity sliders per category + custom blocklistsPer-GPT action allow-list + data-domain restrictionsCannot customize; disclaim hallucinations in app layerPrompt rewriting + style blocklists; cannot disable C2PA
Added latency~150-400ms per call (text); ~400-800ms (image)Negligible — runs in the model forward pass~100-300ms per request (Microsoft-managed)Negligible at runtime; review overhead in publishingNegligible at inference timeAdds 1-3s prompt-rewriting step for some prompts
Languages40+ languages; strongest in English and major Western languagesStrongest in English; degrades in low-resource languagesSame as underlying OpenAI modelsLanguage-agnostic; depends on action target99+ languages transcription per OpenAIPrompt input multilingual; output style English-tuned
Best fitAny app accepting open-ended user input cheaplyDefault protection for anyone calling chat completionsEnterprises needing audit logs + severity tuning + EU residencyBuilders shipping plugins in the GPT Store or EnterpriseVoice and call-center apps transcribing real usersMarketing, design, and consumer image generation apps
Documentation URLplatform.openai.com/docs/guides/moderationopenai.com/safety + system cardslearn.microsoft.com Azure OpenAI content-filterplatform.openai.com/docs/actionsplatform.openai.com/docs/guides/speech-to-textopenai.com/dall-e-3 + C2PA help article

Sources as of June 2026 — verify at openai.com/safety/, https://platform.openai.com/docs/guides/moderation, https://openai.com/policies/usage-policies/, https://openai.com/safety/preparedness/, https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter, and the GPT-4, GPT-4o, GPT-5, o1, and o3 system cards published at openai.com. Model behavior, refusal thresholds, and Azure filter defaults change between releases — confirm in writing before relying on any specific severity setting for compliance.

What each OpenAI safety layer actually does (and the marketing copy to ignore)

The **Moderation API** (omni-moderation-latest as of late 2024 and still current in June 2026) is a free classifier that scores text and image inputs against 13 harm categories — sexual, sexual/minors, harassment, harassment/threatening, hate, hate/threatening, self-harm, self-harm/intent, self-harm/instructions, violence, violence/graphic, illicit, and illicit/violent — per https://platform.openai.com/docs/guides/moderation. The trap most builders fall into is treating it as a yes/no gate. It is not. It returns category scores and boolean flags, and you decide which thresholds map to which actions (block, warn, log-and-allow). Treating any flagged category as auto-block produces a brittle product. Treating the category scores as a tunable risk budget is the design pattern OpenAI's own docs recommend.

The **o-series refusal layer** is not an API — it is behavior trained into the o1, o3, and o3-mini model weights through reinforcement learning from human feedback plus deliberative alignment training. The published o1 system card at https://openai.com/safety/ documents the categories the model refuses (CBRN uplift, autonomy, cybersecurity tasks above a threshold, persuasion), the evaluation methodology, and the residual risk. The marketing line is that o-series models are more aligned than GPT-4o. The honest read is that o-series models think longer about whether to refuse, which both reduces jailbreak success and increases over-refusal on benign medical and legal questions. There is no single dial.

**Azure OpenAI Content Filter** at https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter is Microsoft's per-deployment overlay. Unlike the OpenAI Moderation API, it ships with four severity tiers per category (safe, low, medium, high) plus a separate jailbreak detector, a protected-material classifier, and a groundedness checker for retrieval-augmented apps. Every Azure OpenAI deployment runs Content Filter by default at medium severity across all categories — you do not opt in, you opt out (and the opt-out for sensitive categories requires Microsoft approval). For most enterprise buyers, this is the right default. For consumer apps where over-refusal is a UX disaster, you will spend a week tuning per-category sliders.

**Custom GPT actions safety** is the narrowest layer and the most misunderstood. When a builder ships a GPT in the GPT Store or for ChatGPT Enterprise, the action layer enforces a URL allow-list per https://platform.openai.com/docs/actions, plus a consent flow before sensitive data leaves the conversation. There is no content moderation here beyond the base ChatGPT moderation — the action-layer protection is about data exfiltration and third-party API abuse, not harmful outputs. If you are reviewing a vendor's Custom GPT, that distinction is the single most important question to ask.

**Whisper** safety is mostly about what Whisper does not do. Per OpenAI's Whisper paper and the speech-to-text docs at https://platform.openai.com/docs/guides/speech-to-text, Whisper transcribes audio and returns text — it does not refuse to transcribe specific audio. The safety considerations are upstream (disallowed audio uses per the usage policy) and downstream (Whisper is known to occasionally hallucinate text on silent or noisy segments, which matters in medical and legal transcription). The mitigation is at the application layer: confidence scoring, manual review for medical-grade transcripts, and clear disclaimers in the UI.

**DALL-E 3** ships two notable safety features beyond model-level refusal: C2PA content credentials per https://help.openai.com/en/articles/8912793-c2pa-in-dall-e-3, and an invisible watermark embedded in the image pixels. C2PA is a signed metadata standard for provenance — when a verifier checks a DALL-E 3 image, it gets a cryptographic record that the image was generated by DALL-E 3 and not, for example, captured by a camera. The invisible watermark is robust to compression and cropping. There is also a prompt-rewriting step that softens requests likely to violate policy, which is why your literal prompt often returns an image that interprets it loosely. That rewriting is opaque and frustrating but reduces refusal rates on borderline inputs.


Architecture: how each layer plugs into your stack

The reference architecture OpenAI recommends in its safety best practices at https://platform.openai.com/docs/guides/safety-best-practices puts the **Moderation API** at two points: pre-screen of user input before it is sent to the chat completion endpoint, and post-screen of the model's output before it is rendered to the user. The cost is two extra API calls — both free for text — plus roughly 150 to 400 milliseconds of latency per call. For a chat app where the user is already waiting on streaming tokens, this is invisible. For a real-time voice app where the round-trip budget is under 800ms total, it is significant and you may need to skip the post-screen or run it in parallel and gate the audio stream.

The **o-series refusal layer** requires no integration work — it runs in the model forward pass and adds no measurable latency beyond the o-series' already-longer reasoning trace. The integration consideration is the reverse: when an o-series model refuses, you need an application-layer handler that detects the refusal pattern, logs it for review, and either escalates to a human or returns a graceful fallback. The published refusal phrases are documented in the o1 system card at https://openai.com/safety/, and most teams ship a small classifier on top to detect them reliably.

**Azure OpenAI Content Filter** runs inside the Azure-managed inference path — you do not call it separately, it is bolted onto every deployment. Per https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter, it adds roughly 100 to 300 milliseconds of latency, returns severity scores in the response metadata, and can be configured per-deployment through Azure AI Studio. Critically, you cannot disable it entirely without an exception approval from Microsoft — and the approval process for disabling sensitive-category filters requires a documented business justification and is rarely granted to consumer apps.

**Custom GPT actions** integrate through OpenAI's OAuth flow and a manifest file per https://platform.openai.com/docs/actions. The safety architecture is two-part: at design time, the builder declares the URL allow-list and the data the action will access; at runtime, the user explicitly consents before any data is sent to the third-party API. There is no proxy or filter between the GPT and the action endpoint — the safety guarantee is the consent flow plus the allow-list, nothing more. If your security team treats this as equivalent to an enterprise API gateway, they will be unpleasantly surprised.

**Whisper** integration is a single transcription call per https://platform.openai.com/docs/guides/speech-to-text, with safety entirely at the application layer. Production patterns include confidence-score post-filtering (drop or flag segments below a threshold), profanity filtering through the optional response format, voice-activity detection upstream of Whisper to avoid the silent-segment hallucination failure mode, and human review for any transcript feeding a clinical, legal, or financial decision. The Whisper API itself does not refuse audio inputs.

**DALL-E 3** integration adds a hidden prompt-rewriting step in the API response, which can change the latency profile by 1 to 3 seconds for prompts the safety classifier flags as ambiguous. The C2PA metadata and invisible watermark are added automatically — there is no developer-facing option to disable them in the public API. For provenance verification downstream, the open-source C2PA SDKs (https://opensource.contentauthenticity.org/) read the metadata block. Most teams never integrate the verification path, which means the safety feature exists primarily for platforms and journalists, not the builder.


Red-teaming, the Preparedness Framework, and what the system cards actually tell you

OpenAI's **Preparedness Framework** at https://openai.com/safety/preparedness/ is the governance layer that decides whether a frontier model ships at all. It defines four tracked risk categories — cybersecurity, CBRN (chemical, biological, radiological, nuclear), persuasion, and model autonomy — and assigns each a risk level (low, medium, high, critical) before and after mitigations. The framework's commitment is that no model with a post-mitigation high or critical score in any category can deploy without board approval, and no critical-score model can deploy at all. The original framework was published in late 2023 and updated through 2025; verify the current version at https://openai.com/safety/preparedness/ as the scoring rubric has been refined.

**System cards** are the public artifact that tells you what was actually tested. The GPT-4 system card at https://cdn.openai.com/papers/gpt-4-system-card.pdf was the template — it covers red-team methodology, observed harm categories, mitigation evaluation, and residual risk. Subsequent system cards for GPT-4o, GPT-5, o1, and o3 follow the same structure with category-specific updates. The o1 system card is particularly worth reading because it documents the deliberative alignment training that distinguishes o-series safety from GPT-4o safety. If your compliance team is evaluating OpenAI models for a regulated workflow, the system card is the single most important document you will read.

The **red-teaming program** described at https://openai.com/index/red-teaming-network/ recruits external researchers across domains — biosecurity, chemistry, cybersecurity, persuasion, child safety — to probe pre-release models. The program produces structured evaluation results that feed into the Preparedness Framework scoring. As a buyer, you do not get access to the raw red-team results, but the system card summarizes attack categories tested, success rates pre and post mitigation, and residual risk classifications. That summary is your evidence base.

The honest critique of the system-card approach: the documents are comprehensive on the categories OpenAI chooses to evaluate and silent on categories they do not. The GPT-5 system card, when published, covered cybersecurity uplift in depth but said less about niche misuse patterns like targeted disinformation against specific minority groups. If your application has a specific harm profile that is not in the standard four categories, you need to red-team it yourself — the published evaluations will not cover you. Use the system card as a floor, not a ceiling.

For practical procurement, three documents matter: the current system card for your specific model (linked from https://openai.com/safety/), the Preparedness Framework risk classification (published with each major model release), and the usage policies at https://openai.com/policies/usage-policies/. Together these define what OpenAI represents the model can be used for and what residual risk remains. Reference them by version and date in your security review documentation, because OpenAI updates them and the version you signed against is the version you can defend on.

If you operate in the EU under the EU AI Act, the system card plus the Preparedness Framework classification together substitute for a substantial fraction of the technical documentation required for general-purpose AI model providers under Article 53. They do not substitute for the downstream deployer obligations under Articles 50, 52, or the high-risk Annex III obligations, which remain your responsibility. The compliance landscape is broken down in OpenAI vs Anthropic data policies.


Real use-case decision matrix: which safety layers to ship for which risk tier

If you are building a consumer chat app for general productivity, ship the **Moderation API** pre-screen on user inputs plus the default **o-series refusal** layer and call it done. The free Moderation API catches the obvious abuse vectors (CSAM attempts, explicit harassment, self-harm escalation), and the o-series in-model refusal handles the long tail. Adding the post-screen on outputs roughly doubles your moderation cost in latency, not dollars — for a non-streaming chat the latency is fine; for streaming, run the post-screen against the full assembled response after streaming completes and gate any retention or sharing on the result.

If you are building a regulated enterprise app — healthcare, financial services, legal — buy **Azure OpenAI** instead of calling the OpenAI API directly. The Azure Content Filter at https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter gives you per-deployment severity sliders, audit logging, ZDR by default for approved tenants, and a Microsoft compliance posture (HIPAA, FedRAMP High, EU data boundary) that the direct OpenAI API does not match out of the box. The trade-off is roughly 2 to 6 weeks of lag behind OpenAI's direct model releases and a higher per-token cost. For regulated buyers, the trade-off is correct.

If you are shipping a Custom GPT for internal or external distribution, treat the action layer as a strict allow-list problem, not a content moderation problem. Define exactly which URLs the GPT can call, exactly which data fields it can pass, and require human consent at every external boundary per https://platform.openai.com/docs/actions. Combine that with the base ChatGPT moderation and the o-series refusal layer that already protects the conversation. Do not assume the GPT Store review process catches data exfiltration risks in your actions — it catches some, not all.

If you are running a voice agent or transcription workflow on **Whisper**, the safety design is upstream and downstream of Whisper, not in Whisper itself. Upstream: voice activity detection to avoid hallucination on silent segments, optional profanity filtering, and consent-to-record flows for any user-facing application. Downstream: confidence-score thresholding, human review for any transcript driving a clinical or financial decision, and clear UI disclaimers about transcription accuracy. The Whisper layer protects against disallowed-use categories under the usage policy but does not protect against transcript errors — that is your job.

If you are running an image generation workflow on **DALL-E 3**, the C2PA metadata plus the invisible watermark per https://help.openai.com/en/articles/8912793-c2pa-in-dall-e-3 give you provenance for downstream verification, but the meaningful safety work is at the prompt layer. Block prompts that name living public figures, restrict trademarked brand references, and surface the DALL-E 3 prompt-rewriting step to the user when it materially changes their request. Marketing teams especially will be confused when their prompt for 'a man who looks like Elon Musk' returns a generic businessman — explain the rewriting once and the complaints stop.

If you operate in jurisdictions with strict child safety, financial promotion, or political advertising rules — the UK Online Safety Act, EU DSA Article 28, the Singapore POFMA — the Moderation API is necessary but not sufficient. You will need a regional moderation layer on top (Hive, Microsoft Purview, or a vendor like Spectrum Labs) for the jurisdiction-specific categories, plus a human review queue for borderline content. The OpenAI stack is a strong foundation, not a complete substitute for jurisdiction-specific compliance work. Map your obligations in LLM jailbreak prevention before you scope the build.


Pricing and operational cost: what you actually pay for safety

The **Moderation API** is free per https://platform.openai.com/docs/guides/moderation — omni-moderation-latest costs zero dollars per call for both text and image inputs. The real cost is latency, not dollars. Each pre-screen call adds roughly 150 to 400 milliseconds for text and 400 to 800 milliseconds for images. For a streaming chat application with 200-token responses, the pre-screen is a single front-loaded penalty that disappears into the user's perception of streaming start time. For a sub-second voice agent, that 400ms can push you past the round-trip budget and you may need to skip pre-screening or run it asynchronously with a back-off.

The **o-series refusal layer** is included in the model token price — there is no separate safety fee. The cost shows up indirectly in the longer reasoning traces o-series models produce compared to GPT-4o. Per https://openai.com/api/pricing/, o3-mini is roughly $1.10 per million input tokens and $4.40 per million output tokens (verify current pricing — it changes), but a single o-series response often produces 5x to 20x more reasoning tokens than a comparable GPT-4o response. Budget accordingly. If you switch from GPT-4o to o3-mini for safety reasons, expect your per-conversation cost to roughly double for the same user-visible output.

**Azure OpenAI Content Filter** adds no per-call fee — it is included in the underlying Azure OpenAI deployment cost per https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter. Azure OpenAI itself prices roughly in line with the direct OpenAI API, occasionally with a 10 to 20 percent premium on specific models. The real cost is the engineering time to tune per-category severity sliders, write the exception-approval business case if you need to lower a sensitive-category filter, and integrate the response-metadata severity scores into your application logging.

**Custom GPT actions safety** is included in the underlying ChatGPT Plus ($20/month), ChatGPT Team ($30/seat/month), or ChatGPT Enterprise (custom pricing typically $60-$100/seat/month per OpenAI sales conversations) per https://openai.com/chatgpt/pricing/. The action layer itself has no separate fee. The hidden cost is the GPT Store review process, which adds publishing delay, and the engineering work to define a tight URL allow-list and consent flow per action.

**Whisper** safety is included in the $0.006 per minute transcription cost per https://openai.com/api/pricing/. For a customer support center processing 100,000 minutes per month, that is $600 in transcription cost — the safety layer adds nothing on top. The operational cost is the upstream voice-activity detection (commodity, near-zero) and the downstream confidence-score handling (engineering time, not API cost). If you switch to Whisper Large via a third-party hosting provider for cost reasons, you lose the OpenAI usage-policy enforcement layer and inherit the third-party's safety story.

**DALL-E 3** safety is included in the $0.040 per standard image and $0.080 per HD image cost per https://openai.com/api/pricing/. The prompt-rewriting step adds 1 to 3 seconds of latency for ambiguous prompts, which matters in interactive design tools but not in batch generation. The C2PA metadata and watermark add no measurable cost. The operational cost is user education — your design and marketing teams will repeatedly ask why their prompts were rewritten or refused, and you will need a UI surface that explains the rewriting transparently. To sanity check the full image-generation budget across DALL-E 3 and alternatives, model it through the OpenAI API cost calculator.


Build vs. buy: when to roll your own moderation layer

Some teams ask whether they should skip OpenAI's free Moderation API and build their own classification stack on Whisper-style fine-tuned classifiers or open-source models like Llama Guard. For pure cost reasons, the answer is almost always no — the OpenAI Moderation API is free, runs at 150 to 400 milliseconds, and covers the 13 categories most consumer apps need. Building an equivalent in-house costs engineering time, infrastructure, ongoing classifier maintenance, and a security review at every model update.

Where build-your-own does make sense: niche category coverage the OpenAI Moderation API does not provide. If you operate a fintech app and need to detect investment-advice solicitation, an OpenAI customer-service app and need to detect competitor-mention attempts, or a children's education app and need to detect specific developmentally inappropriate content, you will be building category-specific classifiers regardless. The right architecture is OpenAI Moderation API for the standard 13 categories plus your custom classifier for the niche ones, called in parallel.

The hybrid pattern that works in 2026: pre-screen with the free **OpenAI Moderation API** for the standard categories, pre-screen in parallel with **Llama Guard 3** (https://huggingface.co/meta-llama/Llama-Guard-3-8B) or a small fine-tuned classifier for the niche categories, run the **chat completion** with a strict system prompt, and post-screen the output against both classifiers. This adds roughly 300 to 600 milliseconds of total latency, costs about $0.0002 per moderation call for Llama Guard self-hosted, and gives you defense in depth without an enterprise-grade vendor contract.

If you go fully self-hosted for moderation — Llama Guard plus open-source classifiers — you take on the responsibility for usage-policy interpretation, category-definition drift, model-version updates, and the audit trail. For a 5-engineer startup, this is usually a distraction from the core product. For a 200-engineer ML org with a dedicated trust-and-safety team, it is a reasonable build. The break-even is rarely about dollars; it is about whether trust-and-safety is core to your product strategy or a compliance overhead.

The build-vs-buy decision interacts with the **Azure OpenAI Content Filter** decision: if you are already running Azure for compliance reasons, the Azure Content Filter is included and tunable, and building a parallel custom classifier on top is overkill for most workloads. If you are running the direct OpenAI API for cost or feature-recency reasons, layering the free Moderation API plus a niche custom classifier replicates most of what Azure gives you for free, with more engineering effort and less Microsoft compliance paperwork.

The bottom line on build vs. buy: do not rebuild what OpenAI ships for free. Do build the category-specific classifiers your domain requires. And do not assume the OpenAI Moderation API alone is enough for regulated workflows — it is the foundation, not the full safety stack. For the broader vendor-comparison view, see safety features GPT vs Claude vs Gemini.


Microsoft Azure OpenAI safety overlay: what changes when you switch providers

When you switch from the direct OpenAI API to **Azure OpenAI Service**, six things change in the safety architecture. First, the **Content Filter** at https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter runs by default on every deployment with four severity tiers per category — you do not have to call it, it is in the path. Second, **ZDR** (no abuse-monitoring log retention) is available by default for approved Azure tenants, where on direct OpenAI it requires an enterprise tier upgrade and explicit ZDR contract addendum.

Third, the **compliance umbrella** changes — Azure OpenAI inherits Microsoft's HIPAA business associate agreement, FedRAMP High authorization for Azure Government, and EU Data Boundary commitment per https://learn.microsoft.com/en-us/azure/compliance/. Direct OpenAI offers an enterprise BAA but does not have FedRAMP High or the EU Data Boundary equivalent in the same packaged form. For US public sector buyers, this is the gating factor — Azure Government is the only path.

Fourth, the **jailbreak detector** and **protected material classifier** ship with Azure as separate filters beyond the core Content Filter. The jailbreak detector flags inputs that attempt to bypass system prompt instructions. The protected material classifier flags outputs that quote substantial copyrighted text or code verbatim. Neither is available as a direct OpenAI API endpoint as of June 2026 — they are Azure-only features per https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter.

Fifth, **model release timing** changes. New OpenAI models typically ship on the direct OpenAI API first and reach Azure 2 to 6 weeks later — sometimes longer for less-prioritized models. If your application depends on day-zero access to the newest model, Azure is the wrong choice. If your application depends on stable, compliance-friendly deployment, the lag is a feature, not a bug.

Sixth, the **groundedness checker** in Azure AI Studio per https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter detects when an RAG response is not supported by the retrieved source documents — a useful guardrail against hallucination in enterprise search and Q&A apps. There is no direct-OpenAI equivalent as of June 2026. For RAG-heavy enterprise workflows, this alone often justifies the Azure switch even if you do not need the compliance features.

The honest trade-off: Azure OpenAI is the right choice for regulated enterprise buyers who value compliance posture, audit logging, and an integrated Microsoft ecosystem more than they value latest-model access and the cheapest possible per-token price. Direct OpenAI is the right choice for startups, consumer apps, and feature-velocity-driven teams that need GPT-5 or o3 on release day and can absorb the lighter compliance defaults. Most enterprise buyers in 2026 land on Azure for production and direct OpenAI for prototyping — verify the current model availability gap at https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models before assuming.


The opinionated 2026 pick: what I would ship

If I were shipping a new consumer chat product on OpenAI tomorrow, I would ship **GPT-4o or GPT-5 chat completions** with the free **Moderation API** pre-screen on inputs, the default in-model refusal layer, and a post-screen on assembled outputs run asynchronously against retention and sharing flows. Total added latency: roughly 200 to 500 milliseconds front-loaded, plus a parallel post-screen for compliance logging. Total added cost: zero dollars beyond model tokens. This is the right floor for any consumer app accepting open-ended user input.

If I were shipping a regulated enterprise app — healthcare, financial services, US public sector — I would deploy on **Azure OpenAI** with the default Content Filter at medium severity, the jailbreak detector on, the protected material classifier on, ZDR enabled for the tenant, and groundedness checking for any RAG workflow. Latency adds roughly 100 to 300 milliseconds. The compliance posture pays for the model-release lag. Verify per-deployment configuration at https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter.

If I were shipping a voice agent on **Whisper** plus GPT-4o for response generation, I would add voice-activity detection upstream, confidence-score post-filtering on the transcript, the Moderation API on the assembled user turn before it hits the chat endpoint, and human review for any transcript driving a clinical or financial decision. Whisper itself does not refuse audio inputs, so the application layer is the moderation point.

If I were shipping an image generation feature on **DALL-E 3** for marketing or design teams, I would surface the prompt-rewriting step transparently in the UI, block prompts naming living public figures or trademarked brands, leverage the C2PA metadata for downstream provenance verification on any image shared externally, and document the watermark behavior in the user-facing FAQ so designers understand they cannot remove it.

If I were shipping a Custom GPT for internal distribution to a 500-person company, I would define a strict URL allow-list per action, require human consent at every external boundary, run the GPT under ChatGPT Enterprise (not Plus or Team) for the ZDR and SSO guarantees, and pair every published GPT with an internal security review that explicitly evaluates the action endpoints, not just the system prompt.

The one thing I would not do in 2026 is run an OpenAI-powered product in production without the Moderation API pre-screen on inputs. It costs zero dollars and 150 to 400 milliseconds, and it catches the abuse patterns that show up on day one of any public launch. Skipping it is a choice you will regret when the first scraped CSAM attempt or coordinated harassment campaign hits your logs in week two.

How to implement the OpenAI safety stack for your team

  1. 1

    Step 1: Map your application's risk tier before choosing layers

    Before you pick which safety layers to deploy, write down the answers to four questions in a single page: who are your users (consumer, enterprise, regulated industry, minors), what is the worst-case harm if the model misbehaves (reputation, regulatory fine, physical safety), what jurisdictions do you operate in (EU AI Act, UK Online Safety Act, US sector regulators), and what is your latency budget (sub-second voice, multi-second chat, batch). A consumer chat app for adults in the US can ship with just the Moderation API and in-model refusal. A clinical decision support tool in the EU needs Azure OpenAI plus groundedness checking plus human review plus EU AI Act high-risk documentation. The layers you choose follow directly from this mapping. Do not pick layers first and rationalize the risk tier after.

  2. 2

    Step 2: Wire up the Moderation API pre-screen as your floor

    For any application accepting open-ended user input, the first integration is the free Moderation API on the input path. Call POST https://api.openai.com/v1/moderations with omni-moderation-latest before passing the input to the chat completion endpoint. Per https://platform.openai.com/docs/guides/moderation, store the full category-score response in your logs (not just the boolean flagged result) so you can tune thresholds later without re-screening historical data. Define application-layer actions per category: hard block for sexual/minors and self-harm/instructions at any score, warn-and-allow for harassment and hate above 0.7, log-only for everything else. The thresholds are application-specific — start strict, loosen based on false-positive review of the first 10,000 calls.

  3. 3

    Step 3: Decide on Azure OpenAI vs direct OpenAI based on compliance, not preference

    If you have a HIPAA workflow, FedRAMP requirement, EU Data Boundary requirement, or any sector regulator that scrutinizes your cloud provider list, deploy on Azure OpenAI. Configure the Content Filter at https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter at medium severity across all four categories as the default. Enable the jailbreak detector and protected material classifier. Enable groundedness checking if you ship any RAG workflow. Request ZDR enrollment if your tenant qualifies. If you do not have a regulated workflow, the direct OpenAI API is fine — but layer the free Moderation API, document your usage policy adherence, and budget for an enterprise tier ZDR contract addendum if your sales cycle demands it. Do not run direct OpenAI in production for a regulated workflow because the engineering team prefers it.

  4. 4

    Step 4: Read the system card and Preparedness Framework classification for your model

    Before you commit to a specific model in production, read the current system card for that model linked from https://openai.com/safety/ and the Preparedness Framework classification at https://openai.com/safety/preparedness/. Note the model version, the red-team category coverage, the pre-mitigation and post-mitigation risk scores, and any residual-risk caveats. File these documents in your compliance evidence repository with the date and version number, because OpenAI updates them and the version you signed against is the version your auditor will ask about. If your application has a harm profile that the system card does not cover — say, a specific category of disinformation against a specific community — design a red-team evaluation of your own and run it before launch.

  5. 5

    Step 5: Build an incident-response runbook before you launch

    Define the response path for the three failure modes that actually happen in production: (a) the Moderation API misclassifies a legitimate input and the user complains, (b) the model produces a harmful output that the post-screen missed, and (c) an external researcher reports a jailbreak in your specific application. For each, document who triages, how fast you respond, who notifies the user, and who notifies OpenAI per https://openai.com/policies/usage-policies/. Set up a dedicated channel for safety-incident triage with engineering, trust-and-safety, legal, and communications represented. Run a tabletop exercise within the first 30 days of launch. The teams that ship safely in 2026 are not the teams with the most safety layers — they are the teams that practiced what to do when a layer fails.

Frequently Asked Questions

Is the OpenAI Moderation API actually free, or are there hidden costs?

It is genuinely free in dollar terms — per https://platform.openai.com/docs/guides/moderation, omni-moderation-latest carries no per-call charge for either text or image inputs as of June 2026. The hidden cost is latency, not dollars. Each call adds roughly 150 to 400 milliseconds for text and 400 to 800 milliseconds for images. For an interactive chat app with streaming output, this is invisible because the user is already waiting on first-token latency. For a sub-second voice agent, that 400ms can break your round-trip budget and you may need to run the pre-screen in parallel with the model call and gate retention rather than blocking the response. There are no usage quotas published as of June 2026 — verify at the docs page before you architect around free unlimited calls.

Do I still need third-party moderation if I use Azure OpenAI Content Filter?

For the four standard categories (hate, sexual, violence, self-harm) plus jailbreak detection and protected material, no — the Azure Content Filter at https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter covers the core. For niche categories Azure does not classify — investment-advice solicitation in fintech, competitor-mention attempts in customer service, developmentally inappropriate content in children's education — you will need either a custom classifier or a vertical vendor like Hive, Spectrum Labs, or Microsoft Purview on top. The right pattern is Azure Content Filter as the foundation, your niche classifier in parallel, and a human review queue for anything either layer flags as borderline. Do not assume Azure alone covers domain-specific compliance categories.

What is the difference between the OpenAI Moderation API and the o-series model's built-in refusal?

They are different layers with different purposes. The **Moderation API** is a classifier that scores text or image inputs against 13 categories per https://platform.openai.com/docs/guides/moderation and returns scores plus boolean flags — you decide what to do with the result (block, warn, log). The **o-series refusal** is behavior trained into the o1, o3, and o3-mini model weights through reinforcement learning and deliberative alignment, documented in the system cards at https://openai.com/safety/. When o-series refuses, it returns a refusal phrase in the model output — you cannot turn it off, only handle it in application logic. The right architecture uses both: Moderation API as a pre-screen filter at the input boundary, in-model refusal as the last-line defense, and a post-screen Moderation call on assembled outputs.

Where does OpenAI publish its red-team evaluations and Preparedness Framework risk scores?

The Preparedness Framework itself is at https://openai.com/safety/preparedness/ — it defines the four tracked categories (cybersecurity, CBRN, persuasion, autonomy) and the risk-level rubric. The per-model risk classifications appear in the system cards published alongside each major model release at https://openai.com/safety/ — for example, the GPT-4 system card at https://cdn.openai.com/papers/gpt-4-system-card.pdf set the template, and subsequent system cards for GPT-4o, GPT-5, o1, and o3 follow the same structure. The red-team program itself is described at https://openai.com/index/red-teaming-network/. As a downstream buyer you do not get raw red-team data, but the system card summarizes attack categories, success rates, and residual risk classifications — that is your evidence base for compliance documentation.

How does the DALL-E 3 invisible watermark and C2PA metadata actually work?

DALL-E 3 embeds two provenance markers per https://help.openai.com/en/articles/8912793-c2pa-in-dall-e-3. First, **C2PA content credentials** are cryptographically signed metadata in the image file — a downstream verifier reading the metadata gets a signed record that the image was generated by DALL-E 3 with a specific prompt timestamp. Open-source verifiers from https://opensource.contentauthenticity.org/ read this metadata. Second, an **invisible watermark** is embedded in the image pixels themselves — robust to compression, cropping, and most format conversions, designed to be detectable even after the C2PA metadata is stripped. You cannot disable either in the public API as of June 2026. The watermark detector itself is not public, which limits independent verification — that is a real critique of the current design.

Can I get Zero Data Retention on the direct OpenAI API or only on Azure?

Both, but with different defaults. **Azure OpenAI** offers ZDR by default for approved tenants — no abuse-monitoring log retention, no human review of your inputs or outputs — per https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/abuse-monitoring. **Direct OpenAI API** offers ZDR through the enterprise tier with an explicit contract addendum — you need to be on an enterprise contract, request ZDR enrollment, and have your use case approved. Per https://openai.com/enterprise-privacy/, OpenAI does not train on enterprise data by default regardless of ZDR, but the 30-day abuse-monitoring log retention is the difference. If ZDR is a hard requirement and you are not at enterprise spend levels, Azure is the faster path. Verify the current ZDR terms in writing before you depend on them.

What categories does the OpenAI Moderation API actually cover in 2026?

Per https://platform.openai.com/docs/guides/moderation, omni-moderation-latest covers 13 categories: sexual, sexual/minors, harassment, harassment/threatening, hate, hate/threatening, self-harm, self-harm/intent, self-harm/instructions, violence, violence/graphic, illicit, and illicit/violent. Each is returned as both a boolean flag and a continuous score between 0 and 1 — you set the threshold for each per your application's risk tolerance. Note what is not covered: financial fraud-as-a-service, copyright infringement (Azure's protected material classifier covers this), medical misinformation as a distinct category, election-related disinformation, and most jurisdiction-specific categories (UK Online Safety Act priority illegal content, EU DSA terrorist content). For those, layer a regional or domain-specific classifier on top of the Moderation API as your floor.

How long does an OpenAI safety review take for ChatGPT Enterprise procurement?

A typical ChatGPT Enterprise security review for a 500-to-2000-seat deployment runs 4 to 10 weeks end to end. Plan on 1 to 2 weeks to assemble the request package (data flow diagrams, intended use cases, data classification), 2 to 4 weeks for OpenAI's enterprise team to respond with the standard documentation pack (SOC 2 Type II report, enterprise BAA template, DPA, ZDR addendum, subprocessor list), 1 to 2 weeks of legal review, and 1 to 2 weeks of final negotiation. Faster is possible if your security questionnaire is standard (SIG Lite, CAIQ); slower if you have novel residency requirements or sector-specific (HIPAA, FedRAMP) needs. For HIPAA workflows specifically, get the enterprise BAA executed first — that gates most of the rest of the conversation. Document everything by version per https://openai.com/enterprise-privacy/ because OpenAI updates the standard pack and the version you signed against is the version you can defend on.

What is the most common OpenAI safety implementation mistake teams make at launch?

Three tie for the top spot. First, treating the Moderation API as a yes/no gate rather than a tunable risk budget — application teams ship with default thresholds, get hammered by false positives on legitimate user content, then disable the moderation entirely instead of tuning per-category. Second, forgetting to log the full category-score response from every moderation call — when you later need to retune thresholds, you have only boolean flags in your logs and you cannot do the analysis. Third, shipping with no incident-response runbook — the first time a harmful output slips through the post-screen, the team scrambles to figure out who triages, who responds to the user, and how to notify OpenAI under https://openai.com/policies/usage-policies/. All three are preventable with a 2-week pre-launch checklist. The teams that ship safely in 2026 build the runbook before they build the launch announcement.

You now know how OpenAI's safety stack actually fits together. Now make every prompt your AI tools run actually hit.

AI Prompt Generator builds production-ready system prompts that work across ChatGPT, Claude, Gemini, the OpenAI Moderation API, Azure OpenAI deployments, and every other safety-layered tool in this article — so your trust-and-safety reviews, red-team evals, and incident-response drills get sharper data, not generic AI fluff. Stop tweaking prompts by hand and start shipping prompts that drive measurable lift. 14-day free trial, no credit card required.

Browse all prompt tools →