A jailbreak circumvents the guardrails installed by safety training. Wei et al. (2023) attribute success to two failure modes — competing objectives and mismatched generalisation — and Zou et al. (2023) showed that automatically optimised adversarial suffixes can transfer across many models, including closed commercial ones.
Jailbreaking differs from prompt injection: a jailbreak targets the model's own safety policy, while injection smuggles instructions through data the model processes. Both are active, unsolved security problems.