Prompt hacking
Overview
Prompt hacking is an umbrella term for adversarial manipulation of language model behavior through crafted input text. The term encompasses any technique in which an attacker uses the model's input to cause it to deviate from intended behavior, whether by overriding developer instructions, bypassing safety mechanisms, or extracting confidential system information.
Prompt hacking is the superordinate category above prompt injection and jailbreaking: injection specifically targets the override of developer system prompt instructions, while jailbreaking targets the bypass of the model's safety alignment. Both are subtypes of prompt hacking.
The term was coined and popularized in the applied security community; its academic equivalent in the research literature is adversarial prompting.[1]
Taxonomy
| Technique | Target | Mechanism |
|---|---|---|
| Prompt injection | System/developer instructions | Malicious content in user or document input overrides the system prompt |
| Jailbreak | Safety alignment | Elicits policy-violating outputs by framing, roleplay, or encoding |
| Prompt leaking | Confidential system prompt | Instructs the model to repeat or reveal its system prompt |
| Goal hijacking | Task intent | Redirects the model's output toward attacker-specified goals |
| Many-shot jailbreaking | Safety alignment | Uses long context with many examples of policy-violating exchanges |
Distinguishing sub-types
The three sub-types are often conflated but are technically distinct:
- Injection exploits the model's inability to distinguish trusted (developer) from untrusted (user/document) input.
- Jailbreaking exploits the model's safety alignment training, not the system prompt architecture.
- Prompt leaking is a confidentiality attack, not a policy-bypass attack.
A single attack may combine techniques (e.g., injecting text into a document that jailbreaks the model reading the document).
Defenses
Defenses operate at different layers:
- Architectural: privileged/untrusted input channels (Anthropic's Constitutional AI, structured prompting APIs).
- Output monitoring: classifiers checking generated output for policy violations before delivery.
- Prompt hardening: explicit instructions warning the model to ignore adversarial override attempts.
- Input sanitization: filtering known jailbreak patterns from user input before model call.
No defense is complete; the attack surface is inherent to instruction-following architectures.
See also
References
- ↑ Perez, Fábio et al. "Ignore Previous Prompt: Attack Techniques For Language Models." NeurIPS ML Safety Workshop 2022. https://arxiv.org/abs/2211.09527