Benchmark contamination

Benchmark contamination — The leakage of benchmark test data into a model's training set, which inflates evaluation scores.

Overview

Benchmark contamination (also data contamination or test-set leakage) occurs when the examples used to evaluate a model — a benchmark's test set — are present, in whole or in part, in the model's training data. The model can then reproduce memorized answers rather than demonstrate generalization, inflating its measured performance and undermining the benchmark's validity.^[1]

Benchmark contamination is best understood as a subset of the broader phenomenon of data contamination: benchmark contamination specifically concerns evaluation data, while data contamination can refer to any unintended overlap between training and held-out data.

How it arises and is detected

Contamination arises because large training corpora are scraped from the web, where benchmark datasets and their solutions are often published. Detection methods include:

Searching training corpora for verbatim test examples.
Comparing model performance on original vs. perturbed or newly created variants of a benchmark.
Membership-inference and prompting tests that probe whether specific items were memorized.

Mitigations include held-out and frequently refreshed benchmarks, private test sets, and canary strings.

Distinction from related terms

Term	Scope
Benchmark contamination	Test/eval data leaked into training
Data contamination	Umbrella: any train/eval overlap
Overfitting	Fitting training data too closely (not necessarily eval leakage)
Faithfulness	Answer–source agreement, not eval integrity

Benchmark contamination is not the same as overfitting: overfitting concerns how a model fits its training distribution, whereas contamination concerns the integrity of the evaluation because test items were seen during training.

Examples

A model scores far higher on a public benchmark than on a freshly written equivalent — a signature of contamination.
A coding benchmark's solutions appear verbatim in the scraped training corpus, so the model recites them.

References

↑ "Benchmark Data Contamination of Large Language Models: A Survey." arXiv:2406.04244. https://arxiv.org/pdf/2406.04244

[contam-1] "Benchmark Data Contamination of Large Language Models: A Survey." arXiv:2406.04244. https://arxiv.org/pdf/2406.04244

[1]

Anonymous

Search

Benchmark contamination

Namespaces

More

Page actions

Contents

Overview

How it arises and is detected

Distinction from related terms

Examples

See also

References

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Benchmark contamination

Overview

How it arises and is detected

Distinction from related terms

Examples

See also

References

Navigation

Wiki tools

Page tools

Categories