Prompt hacking

Prompt hacking — An umbrella term for techniques that manipulate a language model's behavior through crafted inputs, including prompt injection, jailbreaking, and related adversarial methods.

Overview

Prompt hacking is an umbrella term for adversarial manipulation of language model behavior through crafted input text. The term encompasses any technique in which an attacker uses the model's input to cause it to deviate from intended behavior, whether by overriding developer instructions, bypassing safety mechanisms, or extracting confidential system information.

Prompt hacking is the superordinate category above prompt injection and jailbreaking: injection specifically targets the override of developer system prompt instructions, while jailbreaking targets the bypass of the model's safety alignment. Both are subtypes of prompt hacking.

The term was coined and popularized in the applied security community; its academic equivalent in the research literature is adversarial prompting.^[1]

Taxonomy

Technique	Target	Mechanism
Prompt injection	System/developer instructions	Malicious content in user or document input overrides the system prompt
Jailbreak	Safety alignment	Elicits policy-violating outputs by framing, roleplay, or encoding
Prompt leaking	Confidential system prompt	Instructs the model to repeat or reveal its system prompt
Goal hijacking	Task intent	Redirects the model's output toward attacker-specified goals
Many-shot jailbreaking	Safety alignment	Uses long context with many examples of policy-violating exchanges

Distinguishing sub-types

The three sub-types are often conflated but are technically distinct:

Injection exploits the model's inability to distinguish trusted (developer) from untrusted (user/document) input.
Jailbreaking exploits the model's safety alignment training, not the system prompt architecture.
Prompt leaking is a confidentiality attack, not a policy-bypass attack.

A single attack may combine techniques (e.g., injecting text into a document that jailbreaks the model reading the document).

Defenses

Defenses operate at different layers:

Architectural: privileged/untrusted input channels (Anthropic's Constitutional AI, structured prompting APIs).
Output monitoring: classifiers checking generated output for policy violations before delivery.
Prompt hardening: explicit instructions warning the model to ignore adversarial override attempts.
Input sanitization: filtering known jailbreak patterns from user input before model call.

No defense is complete; the attack surface is inherent to instruction-following architectures.

References

↑ Perez, Fábio et al. "Ignore Previous Prompt: Attack Techniques For Language Models." NeurIPS ML Safety Workshop 2022. https://arxiv.org/abs/2211.09527

[perez-1] Perez, Fábio et al. "Ignore Previous Prompt: Attack Techniques For Language Models." NeurIPS ML Safety Workshop 2022. https://arxiv.org/abs/2211.09527

[1]

Anonymous

Search

Prompt hacking

Namespaces

More

Page actions

Contents

Overview

Taxonomy

Distinguishing sub-types

Defenses

See also

References

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Prompt hacking

Overview

Taxonomy

Distinguishing sub-types

Defenses

See also

References

Navigation

Wiki tools

Page tools

Categories