PhyseaWiki How AI actually works Papers physea.ai →

Jailbreaks

How do you defend against jailbreaks?

There is no single fix, so defense is layered: OWASP's controls plus a guard layer of trained classifiers that screen inputs and outputs around the model. The aim is to make jailbreaks costly and low-impact, not impossible.

Last updated 2026-06-15 · Physea Labs

Because no prompt-level fix is complete, defense is layered. OWASP’s checklist for this risk covers giving the model clear instructions about its role and limits, filtering inputs and outputs for non-allowed content, restricting the model’s access to the minimum it needs, putting a human in the loop for sensitive actions, clearly marking untrusted content, and regular adversarial testing.[1] Most of these do not try to make the model unjailbreakable; they shrink what a successful jailbreak can actually cause.

A second line of defense wraps a guard layer around the model. Instead of trusting the model to refuse on its own, separate classifiers screen the incoming prompt and the outgoing answer. Anthropic’s Constitutional Classifiers are one published example: in a red-teaming exercise where 183 people spent more than 3,000 hours trying to break a guarded model, no one found a universal jailbreak that answered all the forbidden test queries. In automated testing the guarded system refused over 95% of jailbreak attempts. The cost was real but modest: refusals on normal traffic rose by a non-significant 0.38%, and compute went up about 23.7%.[2]

The honest framing is that these measures raise the price of an attack rather than ending it. Even prompt-screening defenses that cut one many-shot attack’s success from 61% to 2% leave a residual gap, and attackers adapt.[3] Treat jailbreak defense as ongoing red-teaming plus damage limitation, not a one-time patch.

Standards & defenses

  • OWASP Gen AI Security Project

    The community standard that ranks prompt injection (and jailbreaking as a form of it) as LLM01 and lists the baseline mitigation controls.

  • Anthropic Constitutional Classifiers

    A research guard layer: trained input and output classifiers that screen prompts and answers around the model. Published study, not a standalone product.

References

  1. LLM01:2025 Prompt Injection — OWASP Gen AI Security Project
  2. Constitutional Classifiers: Defending against universal jailbreaks — Anthropic
  3. Many-shot jailbreaking — Anthropic