Principle:Princeton nlp SimPO Benchmark Decontamination
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Evaluation, NLP |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
A data filtering technique that removes benchmark evaluation content from training datasets to prevent artificially inflated evaluation scores.
Description
Benchmark Decontamination is a data quality practice that detects and removes training samples containing content from evaluation benchmarks. When training data overlaps with evaluation benchmarks (e.g., HumanEval, MBPP), the model effectively memorizes evaluation answers, producing inflated scores that do not reflect genuine capability. Decontamination addresses this by scanning training text for substring matches against known benchmark content (docstrings, prompts, canonical solutions) and excluding contaminated samples. A whitelist of trivially simple patterns (e.g., return x + y) avoids over-filtering common idioms that appear in both benchmarks and legitimate training data.
Usage
Apply this principle when curating training data for supervised fine-tuning or preference optimization where the model will later be evaluated on code generation benchmarks such as HumanEval. It is essential for any training pipeline that draws from broad web-scraped or code-based corpora where benchmark content may inadvertently appear.
Theoretical Basis
The core mechanism is substring containment checking with normalization:
Pseudo-code Logic:
# Abstract decontamination algorithm
benchmark_strings = load_benchmark_docstrings() + load_benchmark_solutions()
trivial_strings = define_trivial_patterns()
for sample in training_data:
normalized_sample = normalize(sample.lower())
contaminated = False
for ref_string in benchmark_strings:
if ref_string not in trivial_strings:
if normalize(ref_string.lower()) in normalized_sample:
contaminated = True
break
if not contaminated:
yield sample # Keep only clean samples
Key design decisions:
- Case-insensitive matching reduces false negatives from formatting differences
- Whitespace normalization collapses formatting variations into a canonical form
- Trivial pattern whitelist prevents removal of ubiquitous code idioms that happen to appear in benchmarks