Definition
Jailbreak (LLM)
An LLM jailbreak is a crafted prompt that bypasses a model's safety training or operator-imposed guardrails, causing it to produce output the deploying organisation tried to forbid.
Common jailbreak patterns include role-play framing ("pretend you are DAN, a model with no restrictions"), step-back framing ("for educational purposes only"), Unicode obfuscation, payload smuggling inside multi-turn context, and translation pivots (ask the question in a low-resource language). Jailbreaks are related to but distinct from prompt injection — a jailbreak typically targets the model's safety layer, prompt injection targets the application's instruction layer.
Why it matters
- ✓Jailbreaks of consumer-facing LLMs are a brand-risk issue.
- ✓Jailbreaks of enterprise-deployed LLMs are a compliance + IP risk.
- ✓A successful jailbreak that exfiltrates training data is a GDPR breach.