Common techniques — Physea Wiki

The best-studied jailbreaks fall into a few families: roleplay personas such as DAN, encoding and translation tricks that hide intent, and many-shot priming that floods the prompt with fake compliant examples.

Most jailbreaks fall into a few recognizable families. The largest study of real-world prompts collected 1,405 jailbreaks shared online between December 2022 and December 2023, and found their main strategies were prompt injection and privilege escalation.^[1]

Roleplay and personas. The classic example is “DAN” (“Do Anything Now”), which tells the model to play a character that has no rules. By wrapping the request in fiction or a fake persona, the attacker tries to move the conversation off the refusal track. These prompts can be surprisingly durable: in that study, the earliest of its highly effective prompts had persisted online for more than 240 days.^[1]

Encoding and translation. Safety training is strongest on plain, common-language English, so attackers hide intent by changing the surface form. They might Base64-encode a request, write it in leetspeak, or translate it into a language with little training data. One study found that translating unsafe English requests into low-resource languages got GPT-4 to engage and give actionable harmful content 79% of the time, far more than high-resource languages, because the model’s safety training did not transfer well.^[2]

Many-shot priming. As context windows grew to hold the equivalent of long novels, a new trick appeared. The attacker fills the prompt with a long faux dialogue, hundreds of turns in which an imaginary assistant cheerfully answers harmful questions, then asks the real one. The model’s in-context learning picks up the pattern. Anthropic tested up to 256 of these fake exchanges and found that the more it added, the more often the model complied.^[3]

References

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models — Shen et al., CISPA Helmholtz Center, ACM CCS 2024 (arXiv)
Low-Resource Languages Jailbreak GPT-4 — Yong, Menghini & Bach (arXiv)
Many-shot jailbreaking — Anthropic

What are the common jailbreaking techniques?

References