What each safety stack actually does (and the marketing copy to ignore)
**OpenAI GPT-5** is the model that hardened the refusal pipeline the most between generations. The published stack at https://openai.com/safety/ describes a layered approach: RLHF on human preference data, rule-based reward models trained against the OpenAI usage policies at https://openai.com/policies/usage-policies/, and a deliberative-alignment phase where the model is taught to reason explicitly about whether a request violates policy before answering. The result is a model that refuses fewer benign requests than GPT-4 while holding the line on actually-harmful ones. The Moderation API at https://platform.openai.com/docs/guides/moderation runs separately and is free — and most teams underuse it.
**Anthropic Claude Opus 4.7** is the Constitutional AI flagship. Per the system card published at https://www.anthropic.com/news/claude-4-7-system-card and the methodology at https://www.anthropic.com/responsible-scaling-policy, the training pipeline uses a written constitution (a set of principles drawn from sources like the UN Declaration of Human Rights and Anthropic's acceptable-use policy) to generate AI feedback (RLAIF) on top of human RLHF. The practical effect is a model that tends to add caveats and reasoning rather than refuse outright — Claude is famously the model most likely to explain why it cannot help and offer a constrained alternative, rather than returning a flat "I can't do that."
**Google Gemini 2.5 Pro** is the most explicitly tunable safety stack of the three. The Gemini API at https://ai.google.dev/gemini-api/docs/safety-settings exposes four safety categories — harassment, hate speech, sexually explicit content, and dangerous content — each with five threshold levels from BLOCK_NONE to BLOCK_LOW_AND_ABOVE. Vertex AI layers a second filter pass per https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/configure-safety-attributes, and Google's SynthID watermarking at https://deepmind.google/technologies/synthid/ marks generated images, audio, video, and (in supported regions) text with imperceptible signals that can be detected by Google's classifier.
Where the marketing copy diverges from reality: all three vendors describe their models as "aligned" and "safe" in ways that suggest the underlying training closed the problem. It did not. Jailbreaks work, hallucinations happen, and the published benchmarks at sites like https://jailbreakbench.github.io/ and https://huggingface.co/spaces/vectara/leaderboard show the gap between vendor claims and red-team reality. The right mental model is layered defense — model training plus moderation API plus your own input/output filters — not "the model is safe, ship it."
Where the marketing copy is fair: all three vendors publish meaningful system cards and red-team evaluations. OpenAI publishes the Preparedness Framework results at https://openai.com/safety/preparedness/. Anthropic publishes ASL-level evaluations under the Responsible Scaling Policy. Google publishes Frontier Safety Framework assessments. None of these are marketing fluff — they are real documents that your security team should read before signing a contract. Skip the blog posts. Read the system cards.
The opinionated read: GPT-5 has the most polished refusal calibration out of the box, Claude Opus 4.7 has the lowest hallucination rate on the Vectara leaderboard most months, and Gemini 2.5 Pro is the only one of the three that lets a developer explicitly dial safety thresholds per category at the API level. Which matters most depends on your use case — and that is what the rest of this guide is about.