PhyseaWiki How AI actually works Papers physea.ai →

Jailbreaks

What are the common jailbreaking techniques?

The best-studied jailbreaks fall into a few families: roleplay personas such as DAN, encoding and translation tricks that hide intent, and many-shot priming that floods the prompt with fake compliant examples.

Last updated 2026-06-15 · Physea Labs

Most jailbreaks fall into a few recognizable families. The largest study of real-world prompts collected 1,405 jailbreaks shared online between December 2022 and December 2023, and found their main strategies were prompt injection and privilege escalation.[1]

Roleplay and personas. The classic example is “DAN” (“Do Anything Now”), which tells the model to play a character that has no rules. By wrapping the request in fiction or a fake persona, the attacker tries to move the conversation off the refusal track. These prompts can be surprisingly durable: in that study, the earliest of its highly effective prompts had persisted online for more than 240 days.[1]

Encoding and translation. Safety training is strongest on plain, common-language English, so attackers hide intent by changing the surface form. They might Base64-encode a request, write it in leetspeak, or translate it into a language with little training data. One study found that translating unsafe English requests into low-resource languages got GPT-4 to engage and give actionable harmful content 79% of the time, far more than high-resource languages, because the model’s safety training did not transfer well.[2]

Many-shot priming. As context windows grew to hold the equivalent of long novels, a new trick appeared. The attacker fills the prompt with a long faux dialogue, hundreds of turns in which an imaginary assistant cheerfully answers harmful questions, then asks the real one. The model’s in-context learning picks up the pattern. Anthropic tested up to 256 of these fake exchanges and found that the more it added, the more often the model complied.[3]

References

  1. "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models — Shen et al., CISPA Helmholtz Center, ACM CCS 2024 (arXiv)
  2. Low-Resource Languages Jailbreak GPT-4 — Yong, Menghini & Bach (arXiv)
  3. Many-shot jailbreaking — Anthropic