Adversarial robustness

From llmref.wiki
Adversarial robustness — A model's ability to maintain correct predictions when exposed to adversarially crafted or perturbed inputs designed to trigger misclassification.

Overview

Adversarial robustness refers to the resistance of a language model or classifier to inputs that have been deliberately modified—either through imperceptible perturbations or semantic manipulation—to cause degraded performance or harmful outputs. In the context of LLMs, adversarial robustness is particularly relevant to security evaluation, as models may be exposed to inputs designed to trigger hallucinations, jailbreaks, or unintended behavior that deviates from intended system prompt constraints.

Adversarial robustness assessment is distinct from general benchmark performance, as a model may achieve high accuracy on standard evaluation datasets while remaining vulnerable to carefully constructed inputs. The threat model varies by application: in content moderation, adversarial examples may be strings that evade filters; in retrieval-augmented generation, poisoned documents may be injected to corrupt answers; in classification tasks, inputs may be perturbed to cross decision boundaries.

Testing for adversarial robustness typically involves red teaming, automated perturbation, or adversarial attack frameworks adapted from computer vision security research. The evaluation may focus on syntactic perturbations (typos, formatting changes), semantic shifts (paraphrasing), or structured manipulations (prompt injection, role injection).

How it is measured

Adversarial robustness is measured through attack-and-defense evaluation frameworks:

  • Attack Success Rate (ASR): The percentage of adversarial examples that successfully trigger unintended behavior, evaluated across multiple attack methods.
  • Perturbation Budget: The maximum amount of change permitted in an input (measured in character edits, token distance, or semantic similarity thresholds) while still counting as a valid attack.
  • Certified Robustness: Formal verification that a model's predictions remain unchanged within a bounded perturbation radius, derived from randomized smoothing or interval bound propagation.
  • Transferability: The degree to which adversarial examples crafted against one model succeed against another, often measured as cross-model attack success rate.

Human evaluation of adversarial robustness is common when assessing semantic attacks that may not be automatically detectable. LLM-as-judge approaches are also used to assess whether model outputs remain aligned with safety objectives under adversarial inputs. Benchmark-based evaluation may include purpose-built adversarial datasets, though such datasets risk contamination if widely known and incorporated into training data.

Distinction from related terms

Term Distinction
Red teaming Red teaming is a methodology (systematic search for vulnerabilities); adversarial robustness is the property being tested. Red teaming is a practice; robustness is a measured characteristic.
Prompt injection Prompt injection is a specific attack technique (malicious input structured to override instructions); adversarial robustness is resistance to any input perturbation, including but not limited to injection attacks.
Hallucination Hallucination is spontaneous generation of false information; adversarial robustness concerns vulnerability to deliberately crafted inputs that amplify hallucination or trigger other failures.
Faithfulness Faithfulness measures whether a model's outputs are factually consistent; adversarial robustness measures whether that consistency degrades under attack. A model can be faithful on clean data but non-robust to adversarial inputs.
Silent failure Silent failure is undetected incorrect output; adversarial robustness concerns the frequency and predictability of failure under adversarial conditions. An adversarially robust model may fail visibly or be caught by detection.

Examples

  • LLM Jailbreak Datasets: The AVARICE and AdversarialQA datasets document adversarial prompts that bypass safety guidelines in models like GPT-3.5 and Llama. These serve as benchmarks for measuring robustness to jailbreak attacks, with success rates quantifying the degree to which system prompts remain effective under adversarial pressure.
  • Adversarial Robustness in RAG Systems: Research on document poisoning attacks demonstrates that RAG pipelines can be compromised when adversarially crafted documents are inserted into vector databases. Robustness is measured by the model's ability to reject or correctly identify poisoned chunks as inconsistent with query intent or source authority signals.
  • Character-Level Perturbations in Text Classifiers: Studies of adversarial examples in spam detection and toxicity classification show that models robust to random typos may remain vulnerable to semantically-preserving character edits (homoglyph substitution, unicode tricks). Certified robustness methods constrain the perturbation radius within which predictions are guaranteed stable.

See also

References