Definition

Jailbreak (LLM)

An LLM jailbreak is a crafted prompt that bypasses a model's safety training or operator-imposed guardrails, causing it to produce output the deploying organisation tried to forbid.

Common jailbreak patterns include role-play framing ("pretend you are DAN, a model with no restrictions"), step-back framing ("for educational purposes only"), Unicode obfuscation, payload smuggling inside multi-turn context, and translation pivots (ask the question in a low-resource language). Jailbreaks are related to but distinct from prompt injection — a jailbreak typically targets the model's safety layer, prompt injection targets the application's instruction layer.

Why it matters

✓Jailbreaks of consumer-facing LLMs are a brand-risk issue.
✓Jailbreaks of enterprise-deployed LLMs are a compliance + IP risk.
✓A successful jailbreak that exfiltrates training data is a GDPR breach.

Related terms

Prompt injection
AI DLP
MCP (Model Context Protocol)