Benchmark contamination
Overview
Benchmark contamination (also data contamination or test-set leakage) occurs when the examples used to evaluate a model — a benchmark's test set — are present, in whole or in part, in the model's training data. The model can then reproduce memorized answers rather than demonstrate generalization, inflating its measured performance and undermining the benchmark's validity.[1]
Benchmark contamination is best understood as a subset of the broader phenomenon of data contamination: benchmark contamination specifically concerns evaluation data, while data contamination can refer to any unintended overlap between training and held-out data.
How it arises and is detected
Contamination arises because large training corpora are scraped from the web, where benchmark datasets and their solutions are often published. Detection methods include:
- Searching training corpora for verbatim test examples.
- Comparing model performance on original vs. perturbed or newly created variants of a benchmark.
- Membership-inference and prompting tests that probe whether specific items were memorized.
Mitigations include held-out and frequently refreshed benchmarks, private test sets, and canary strings.
| Term | Scope |
|---|---|
| Benchmark contamination | Test/eval data leaked into training |
| Data contamination | Umbrella: any train/eval overlap |
| Overfitting | Fitting training data too closely (not necessarily eval leakage) |
| Faithfulness | Answer–source agreement, not eval integrity |
Benchmark contamination is not the same as overfitting: overfitting concerns how a model fits its training distribution, whereas contamination concerns the integrity of the evaluation because test items were seen during training.
Examples
- A model scores far higher on a public benchmark than on a freshly written equivalent — a signature of contamination.
- A coding benchmark's solutions appear verbatim in the scraped training corpus, so the model recites them.
See also
References
- ↑ "Benchmark Data Contamination of Large Language Models: A Survey." arXiv:2406.04244. https://arxiv.org/pdf/2406.04244