Why it stays hard — Physea Wiki

Safety is learned behavior layered on a model that wants to be helpful, so a clever enough prompt can pull it off course. Patches tend to raise the cost of an attack rather than end it.

The root problem is that a model’s safety is trained behavior, not a hard switch. The base model is shaped to be helpful and to continue whatever pattern it sees; safety training adds a learned tendency to refuse certain things on top. A jailbreak is an attempt to find a context where the “be helpful, follow the pattern” pull beats the “refuse this” pull. Because both come from the same statistical process, there is no clean wall between them to defend.

This is why patches so often only buy time. When Anthropic tried to harden models against many-shot jailbreaking by fine-tuning, the technique still worked; fine-tuning merely increased the number of fake examples needed and kept the same scaling pattern.^[1] Raising the cost of an attack is useful, but it is not the same as closing it.

The attack surface is also enormous and cheap to explore. Jailbreaks are just text, so anyone can try thousands of variations, and working ones spread fast on forums, where the earliest of one study’s highly effective prompts had persisted online for over 240 days.^[2] Defenders have to anticipate every phrasing, encoding, language, and roleplay framing, while an attacker needs only one that slips through. That asymmetry, plus the fact that new capabilities (longer context, new languages, tool use) keep opening fresh angles, is why the field treats jailbreaking as an ongoing problem to manage rather than a bug to be fixed once.

References

Many-shot jailbreaking — Anthropic
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models — Shen et al., CISPA Helmholtz Center, ACM CCS 2024 (arXiv)

Why is jailbreaking so hard to stop?

References