Subject 06 · Cross-cutting; matters most once agents can act
Safety & Security
Prompt injection, jailbreaks, alignment, and data privacy. Where tool use turns a wrong answer into a wrong action.
23 pages across 6 topics
Prompt injection
The top LLM security risk.
- One channel Prompt injection is the top-ranked LLM application security risk because a model receives its system prompt, the user's message, and any data it reads as one undifferentiated stream, with no protected channel for instructions separate from data.
- Two forms Direct injection is the user's own input overriding the system's intent. Indirect injection, the more dangerous form, hides instructions inside data the model will later read, turning any untrusted document into a remote control for an agent.
- Why unsolved Prompt injection is effectively unsolved. OWASP says there may be no fool-proof method of prevention, and the NCSC warns it may be an inherent issue with LLM technology, because a probabilistic model can be reworded around any filter.
- How to contain Since you cannot assume prevention, the goal is to make a successful injection low-impact through layered controls like least privilege and human approval, plus by-design defenses such as Google DeepMind's CaMeL that fix the channel rather than the model's behavior.
Jailbreaks
Bypassing a model’s safety training.
- What is a jailbreak A jailbreak is text crafted to make a model disregard its safety training and produce content it was meant to refuse. OWASP treats it as a form of prompt injection, but the two aim at different things.
- Common techniques The best-studied jailbreaks fall into a few families: roleplay personas such as DAN, encoding and translation tricks that hide intent, and many-shot priming that floods the prompt with fake compliant examples.
- Why it stays hard Safety is learned behavior layered on a model that wants to be helpful, so a clever enough prompt can pull it off course. Patches tend to raise the cost of an attack rather than end it.
- Mitigations There is no single fix, so defense is layered: OWASP's controls plus a guard layer of trained classifiers that screen inputs and outputs around the model. The aim is to make jailbreaks costly and low-impact, not impossible.
Alignment basics
Getting models to do what we intend.
- What is alignment Alignment is the gap between what we tell a model to optimize and what we actually want it to do. A system can score perfectly on its training objective and still behave in ways nobody intended.
- RLHF RLHF teaches a model what people prefer by having humans compare its outputs, training a reward model on those comparisons, then optimizing the model against that reward. It is the main reason modern chat assistants feel helpful.
- Constitutional AI Constitutional AI replaces human labels for harmful content with a short written set of principles the model uses to critique and revise its own answers, then learns from AI-generated preferences instead of human ones.
- Outer vs inner Outer alignment asks whether the objective we train on matches what we want. Inner alignment asks whether the model that comes out actually pursues that objective, rather than a stand-in goal that only looked right during training.
Data privacy
What leaves your machine, and what does not.
- What leaves With a hosted API, your prompt and any files you attach travel over the network to a vendor's servers for processing. With a model running on your own hardware, the same text never leaves the device.
- Training on your data For paid business and developer APIs, the major vendors say by default they do not train on your inputs. The picture flips for some consumer and free tiers, where your data may be used unless you opt out.
- Retention and ZDR Even when a vendor does not train on your data, it may store it for a while to detect abuse, commonly on the order of 30 days. Enterprise zero-data-retention and HIPAA arrangements can remove or narrow that storage.
- Local as privacy A vendor's privacy policy is a promise about data that has already left your machine. Running the model on hardware you control means the data never travels at all, so the protection comes from how the system is built rather than from trust.
Evaluating trust
How to tell if output can be relied on.
- Confident vs correct A model's tone of certainty does not track its accuracy. Research finds models are frequently overconfident, and people tend to trust confident-sounding answers even when they are wrong, so confidence is not evidence of correctness.
- Why models guess Models invent plausible answers because the way they are scored rewards guessing over saying I don't know. Like a student on a multiple-choice test, a guess can earn points while an honest blank earns none, so the model learns to guess.
- Grounding & citations Connecting a model to real source documents and showing citations reduces errors, but does not remove them. Studies of cited, source-grounded tools still found errors in 17 to 33 percent of answers, so the citation must be checked, not just counted.
- Benchmarks vs reality A high benchmark score does not guarantee a model will work on your task. Scores can be inflated when test data leaks into training, and a benchmark rarely matches your real conditions, so test on your own examples.
Safe deployment
A checklist before you ship.
- Before you ship Before an AI feature goes live, run a short checklist: give it the least access it needs, keep a human on irreversible actions, filter inputs and outputs, log what it does, test it, and keep a way to turn it off or roll it back.
- Limit its reach Limit reach in three ways: grant the least access the task needs (least privilege), require a human to approve any irreversible action, and filter inputs and outputs because both can carry hidden instructions or leaked data.
- Watch and undo After launch, safety comes from being able to see and undo. Log every decision and tool call, test before release through a CI/CD gate, monitor the deployed system, and keep a documented way to roll it back or deactivate it.