What the Gemini safety stack actually does (and the marketing copy you should ignore)
The **Gemini API safety filters** are the front door. Every call to `generateContent` runs the prompt and the response through four harm-category classifiers: Harassment, Hate Speech, Sexually Explicit, and Dangerous Content. Per https://ai.google.dev/gemini-api/docs/safety-settings, you can pass a `safetySettings` array that sets a threshold per category — `BLOCK_NONE`, `BLOCK_ONLY_HIGH`, `BLOCK_MEDIUM_AND_ABOVE`, or `BLOCK_LOW_AND_ABOVE`. Defaults vary by model and surface; production teams should set them explicitly rather than rely on whatever the SDK picks. The filters return both a verdict and a probability score, which is the part most integrations ignore and then regret.
**Vertex AI Safety** is the enterprise layer. Per https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/configure-safety-attributes, Vertex exposes the same four harm categories plus a broader set of safety attributes (toxicity, violence, profanity, derogatory, sexual, insult). It also adds audit logging, regional controls, and integration with Google Cloud's IAM and VPC Service Controls. This is what you buy when your CISO needs to see who turned `BLOCK_NONE` on, when, and for which project. The native API filters are fine for prototypes; Vertex is what you ship behind real customer traffic.
**The Google Responsible AI Toolkit** at https://ai.google/responsibility/responsible-ai-toolkit/ is a bundle of open-source libraries — the Learning Interpretability Tool (LIT), the AI Test Kitchen, model cards, and the Responsible Generative AI Toolkit. It is not a runtime safety filter. It is the eval and debugging surface you use to figure out which thresholds to set in the first place. Most teams skip this and ship blind; the teams that do not skip it have measurable false-positive rates and a paper trail.
**ShieldGemma** is the open-weight piece. Per https://huggingface.co/google/shieldgemma-2b, ShieldGemma is a family of safety content moderation models trained by Google to classify text against the same four core harm policies the Gemini API uses. The 2B, 9B, and 27B parameter sizes let you trade compute for accuracy. Crucially, ShieldGemma is Gemma-licensed open weights — you can run it on your own GPUs, audit it, fine-tune it for your domain, and never ship a customer prompt to Google. This matters in regulated industries where prompt content itself is sensitive.
**SynthID** at https://deepmind.google/technologies/synthid/ is a different category of safety entirely. It is not a filter — it is a watermark. Google embeds imperceptible signals in generated images (Imagen), audio (Lyria), video (Veo), and increasingly text from Gemini. A separate detector model identifies the watermark with high reliability. The 2026 use case is provenance: when a deepfake or a generated essay shows up in the wild, SynthID lets you prove whether it came from a Google model. It is the only one of these layers that protects downstream consumers rather than the generating application.
**Sec-Gemini-V1** is the model that put adversarial security evaluation into the Gemini model card explicitly. The published card and accompanying technical documentation are what serious security buyers ask for before approving any AI vendor. Combined with Google's **AI Principles** at https://ai.google/principles/ and the **restricted use cases policy**, this is the governance surface — what Google will not let you do with the model at all, regardless of which filters you turn off. Trying to build a weapons-design app on Gemini will hit policy enforcement, not just the Dangerous Content threshold.