What is a jailbreak — Physea Wiki

A jailbreak is text crafted to make a model disregard its safety training and produce content it was meant to refuse. OWASP treats it as a form of prompt injection, but the two aim at different things.

A jailbreak is a piece of input crafted to make a language model ignore its own safety rules and produce something it was trained to refuse, such as instructions for a dangerous task or hateful content. The word borrows from phone jailbreaking: you are not breaking into a server, you are talking the model out of the limits its makers built in.

OWASP, the security community that maintains the standard list of LLM application risks, files jailbreaking under its top-ranked risk, prompt injection (catalogued as LLM01). It defines it plainly: “Jailbreaking is a form of prompt injection where the attacker provides inputs that cause the model to disregard its safety protocols entirely.”^[1]

So the two overlap, but they point at different things. Prompt injection is the broader category: any input that alters a model’s behavior in unintended ways, which often means hijacking an application to make the model do the attacker’s bidding instead of the developer’s. A jailbreak narrows that to one specific aim: defeating the model’s safety training so it will say what it normally would not. You can inject without jailbreaking (for example, redirecting a customer-support bot to leak its hidden prompt), and the line between the terms is blurry enough that people use them interchangeably.^[1]

Rule of thumb Prompt injection asks “whose instructions does the model follow?” A jailbreak asks “can I get past the safety filter?” Most real attacks mix both.

References

LLM01:2025 Prompt Injection — OWASP Gen AI Security Project

What is an LLM jailbreak, and how is it different from prompt injection?

References