Benchmark contamination

From llmref.wiki
(Redirected from Test-set leakage)
Benchmark contamination — The leakage of benchmark test data into a model's training set, which inflates evaluation scores.

Overview

Benchmark contamination (also data contamination or test-set leakage) occurs when the examples used to evaluate a model — a benchmark's test set — are present, in whole or in part, in the model's training data. The model can then reproduce memorized answers rather than demonstrate generalization, inflating its measured performance and undermining the benchmark's validity.[1]

Benchmark contamination is best understood as a subset of the broader phenomenon of data contamination: benchmark contamination specifically concerns evaluation data, while data contamination can refer to any unintended overlap between training and held-out data.

How it arises and is detected

Contamination arises because large training corpora are scraped from the web, where benchmark datasets and their solutions are often published. Detection methods include:

  • Searching training corpora for verbatim test examples.
  • Comparing model performance on original vs. perturbed or newly created variants of a benchmark.
  • Membership-inference and prompting tests that probe whether specific items were memorized.

Mitigations include held-out and frequently refreshed benchmarks, private test sets, and canary strings.

Distinction from related terms

Term Scope
Benchmark contamination Test/eval data leaked into training
Data contamination Umbrella: any train/eval overlap
Overfitting Fitting training data too closely (not necessarily eval leakage)
Faithfulness Answer–source agreement, not eval integrity

Benchmark contamination is not the same as overfitting: overfitting concerns how a model fits its training distribution, whereas contamination concerns the integrity of the evaluation because test items were seen during training.

Examples

  • A model scores far higher on a public benchmark than on a freshly written equivalent — a signature of contamination.
  • A coding benchmark's solutions appear verbatim in the scraped training corpus, so the model recites them.

See also

References

  1. "Benchmark Data Contamination of Large Language Models: A Survey." arXiv:2406.04244. https://arxiv.org/pdf/2406.04244