Prompt hacking

From llmref.wiki
Prompt hacking — An umbrella term for techniques that manipulate a language model's behavior through crafted inputs, including prompt injection, jailbreaking, and related adversarial methods.

Overview

Prompt hacking is an umbrella term for adversarial manipulation of language model behavior through crafted input text. The term encompasses any technique in which an attacker uses the model's input to cause it to deviate from intended behavior, whether by overriding developer instructions, bypassing safety mechanisms, or extracting confidential system information.

Prompt hacking is the superordinate category above prompt injection and jailbreaking: injection specifically targets the override of developer system prompt instructions, while jailbreaking targets the bypass of the model's safety alignment. Both are subtypes of prompt hacking.

The term was coined and popularized in the applied security community; its academic equivalent in the research literature is adversarial prompting.[1]

Taxonomy

Technique Target Mechanism
Prompt injection System/developer instructions Malicious content in user or document input overrides the system prompt
Jailbreak Safety alignment Elicits policy-violating outputs by framing, roleplay, or encoding
Prompt leaking Confidential system prompt Instructs the model to repeat or reveal its system prompt
Goal hijacking Task intent Redirects the model's output toward attacker-specified goals
Many-shot jailbreaking Safety alignment Uses long context with many examples of policy-violating exchanges

Distinguishing sub-types

The three sub-types are often conflated but are technically distinct:

  • Injection exploits the model's inability to distinguish trusted (developer) from untrusted (user/document) input.
  • Jailbreaking exploits the model's safety alignment training, not the system prompt architecture.
  • Prompt leaking is a confidentiality attack, not a policy-bypass attack.

A single attack may combine techniques (e.g., injecting text into a document that jailbreaks the model reading the document).

Defenses

Defenses operate at different layers:

  • Architectural: privileged/untrusted input channels (Anthropic's Constitutional AI, structured prompting APIs).
  • Output monitoring: classifiers checking generated output for policy violations before delivery.
  • Prompt hardening: explicit instructions warning the model to ignore adversarial override attempts.
  • Input sanitization: filtering known jailbreak patterns from user input before model call.

No defense is complete; the attack surface is inherent to instruction-following architectures.

See also

References

  1. Perez, Fábio et al. "Ignore Previous Prompt: Attack Techniques For Language Models." NeurIPS ML Safety Workshop 2022. https://arxiv.org/abs/2211.09527