The 2026 jailbreak taxonomy: six categories that actually matter
The first category, **role-play personas**, is the oldest and the most over-discussed. DAN (Do Anything Now), STAN (Strive To Avoid Norms), and their dozens of variants ask the model to inhabit a fictional character with no safety constraints. Frontier models from OpenAI, Anthropic, and Google have largely closed the obvious DAN-shaped holes through RLHF, but every new model release ships with a fresh crop of community-discovered persona attacks within hours. The JailbreakBench leaderboard at https://jailbreakbench.github.io/ tracks how quickly these get patched and re-discovered.
The second category, **indirect prompt injection**, is now the dominant breach pattern in production LLM applications. The classic example: an attacker plants instructions inside a document, web page, or tool output that the LLM later ingests. The model dutifully follows the injected instructions because it cannot reliably distinguish trusted prompt context from untrusted retrieved content. This is the category that breaks RAG pipelines, autonomous agents, and any system that pipes web search results back into the model. Most of the spend on Lakera Guard and Rebuff in 2026 is justified by this single attack class — see our prompt injection defense playbook for a longer treatment.
The third category, **multi-turn elicitation**, exploits the fact that a single-turn refusal does not carry across a conversation. The attacker softens up the model over five or ten benign turns, gradually shifting context, then asks for the harmful payload on turn eleven. The MASTERKEY paper and the broader research on multi-turn red-teaming show that even well-aligned models leak under sustained pressure. Defenses that only inspect the current message in isolation, without conversation-level state, miss this entire category.
The fourth category, **encoded payloads**, wraps the harmful request in base64, ROT13, leetspeak, hex, or a constructed cipher the attacker also explains to the model. The model decodes the payload as part of being helpful, then acts on it. Encoded-payload jailbreaks reliably bypass naive keyword and regex filters and partially bypass classifier-based filters trained on plaintext attack corpora. The 2024 wave of base64-wrapped attacks is well-documented in HarmBench at https://www.harmbench.org/ and remains effective against weaker classifiers in 2026.
The fifth category, **gradient-based attacks**, came out of the GCG (Greedy Coordinate Gradient) paper at https://arxiv.org/abs/2307.15043 published by Zou et al. in 2023. GCG and its successors compute adversarial suffixes that, when appended to a request, make the model comply with otherwise-refused queries. The headline result was that these suffixes transfer across models — an attack discovered on open-weight Llama can break closed-weight GPT-style models. By 2026, transferability is weaker against the frontier but still meaningful against mid-tier and self-hosted models. Static input filtering does not catch these because the suffix looks like noise.
The sixth category, **many-shot jailbreaking**, was published by Anthropic in April 2024 at https://www.anthropic.com/research/many-shot-jailbreaking. It exploits long context windows: an attacker packs 256 or 512 example exchanges into the prompt where the assistant complies with harmful requests, then asks the real question at the end. The in-context learning effect overrides the refusal training. The longer the context window, the more effective the attack — which means as Claude, GPT-5, and Gemini push toward million-token contexts, this category becomes more relevant, not less.