Jailbreak

LLM Jailbreak

Definition

A jailbreak is an adversarial prompt technique that attempts to bypass a language model's safety guidelines and content filters to produce outputs the model would normally refuse. Jailbreaks exploit prompt framing tricks, roleplay scenarios, and encoding tricks to confuse safety mechanisms.

Robustness against jailbreaks is an ongoing alignment and red-teaming challenge.

Related Terms

Red teaming

AI Red Teaming

Prompt injection

Prompt Injection Attack

Guardrails

Cloud Governance Guardrails

← Back to Glossary

Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.

Talk to a Human See the Product