Definition

Jailbreak (LLM)

An LLM jailbreak is a crafted prompt that bypasses a model's safety training or operator-imposed guardrails, causing it to produce output the deploying organisation tried to forbid.

Common jailbreak patterns include role-play framing ("pretend you are DAN, a model with no restrictions"), step-back framing ("for educational purposes only"), Unicode obfuscation, payload smuggling inside multi-turn context, and translation pivots (ask the question in a low-resource language). Jailbreaks are related to but distinct from prompt injection — a jailbreak typically targets the model's safety layer, prompt injection targets the application's instruction layer.

Why it matters

  • Jailbreaks of consumer-facing LLMs are a brand-risk issue.
  • Jailbreaks of enterprise-deployed LLMs are a compliance + IP risk.
  • A successful jailbreak that exfiltrates training data is a GDPR breach.

Related terms