What each OpenAI safety layer actually does (and the marketing copy to ignore)
The **Moderation API** (omni-moderation-latest as of late 2024 and still current in June 2026) is a free classifier that scores text and image inputs against 13 harm categories — sexual, sexual/minors, harassment, harassment/threatening, hate, hate/threatening, self-harm, self-harm/intent, self-harm/instructions, violence, violence/graphic, illicit, and illicit/violent — per https://platform.openai.com/docs/guides/moderation. The trap most builders fall into is treating it as a yes/no gate. It is not. It returns category scores and boolean flags, and you decide which thresholds map to which actions (block, warn, log-and-allow). Treating any flagged category as auto-block produces a brittle product. Treating the category scores as a tunable risk budget is the design pattern OpenAI's own docs recommend.
The **o-series refusal layer** is not an API — it is behavior trained into the o1, o3, and o3-mini model weights through reinforcement learning from human feedback plus deliberative alignment training. The published o1 system card at https://openai.com/safety/ documents the categories the model refuses (CBRN uplift, autonomy, cybersecurity tasks above a threshold, persuasion), the evaluation methodology, and the residual risk. The marketing line is that o-series models are more aligned than GPT-4o. The honest read is that o-series models think longer about whether to refuse, which both reduces jailbreak success and increases over-refusal on benign medical and legal questions. There is no single dial.
**Azure OpenAI Content Filter** at https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter is Microsoft's per-deployment overlay. Unlike the OpenAI Moderation API, it ships with four severity tiers per category (safe, low, medium, high) plus a separate jailbreak detector, a protected-material classifier, and a groundedness checker for retrieval-augmented apps. Every Azure OpenAI deployment runs Content Filter by default at medium severity across all categories — you do not opt in, you opt out (and the opt-out for sensitive categories requires Microsoft approval). For most enterprise buyers, this is the right default. For consumer apps where over-refusal is a UX disaster, you will spend a week tuning per-category sliders.
**Custom GPT actions safety** is the narrowest layer and the most misunderstood. When a builder ships a GPT in the GPT Store or for ChatGPT Enterprise, the action layer enforces a URL allow-list per https://platform.openai.com/docs/actions, plus a consent flow before sensitive data leaves the conversation. There is no content moderation here beyond the base ChatGPT moderation — the action-layer protection is about data exfiltration and third-party API abuse, not harmful outputs. If you are reviewing a vendor's Custom GPT, that distinction is the single most important question to ask.
**Whisper** safety is mostly about what Whisper does not do. Per OpenAI's Whisper paper and the speech-to-text docs at https://platform.openai.com/docs/guides/speech-to-text, Whisper transcribes audio and returns text — it does not refuse to transcribe specific audio. The safety considerations are upstream (disallowed audio uses per the usage policy) and downstream (Whisper is known to occasionally hallucinate text on silent or noisy segments, which matters in medical and legal transcription). The mitigation is at the application layer: confidence scoring, manual review for medical-grade transcripts, and clear disclaimers in the UI.
**DALL-E 3** ships two notable safety features beyond model-level refusal: C2PA content credentials per https://help.openai.com/en/articles/8912793-c2pa-in-dall-e-3, and an invisible watermark embedded in the image pixels. C2PA is a signed metadata standard for provenance — when a verifier checks a DALL-E 3 image, it gets a cryptographic record that the image was generated by DALL-E 3 and not, for example, captured by a camera. The invisible watermark is robust to compression and cropping. There is also a prompt-rewriting step that softens requests likely to violate policy, which is why your literal prompt often returns an image that interprets it loosely. That rewriting is opaque and frustrating but reduces refusal rates on borderline inputs.