Principle:Allenai Open instruct Data Decontamination
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Evaluation |
| Last Updated | 2026-02-07 02:00 GMT |
Overview
Principle of detecting and removing evaluation benchmark data from training datasets to ensure valid and unbiased model evaluation results.
Description
Data decontamination addresses the critical problem of test set leakage in large-scale language model training. When training data overlaps with evaluation benchmarks, reported performance metrics become inflated and unreliable. The decontamination process involves two phases: (1) indexing training data into a searchable form, and (2) querying that index with evaluation test sets to identify and quantify contamination. Multiple matching strategies exist with different precision/recall tradeoffs: exact string matching (high precision, low recall), n-gram overlap (balanced), and semantic vector similarity (high recall, lower precision). The detected contamination can be used to produce cleaned training datasets or to report contamination rates alongside evaluation results.
Usage
Apply this principle when preparing training datasets for instruction tuning, especially before reporting benchmark results. It is essential for maintaining evaluation integrity in the Tulu model family and any research that claims performance improvements on standard benchmarks.
Theoretical Basis
The contamination detection process uses information retrieval techniques:
For n-gram matching, contamination is measured as token coverage:
For vector matching, semantic similarity is computed via:
where E is an embedding model (e.g., NV-Embed-v2).
Pseudo-code Logic:
# Abstract decontamination pipeline
index = build_search_index(training_data)
for test_instance in evaluation_benchmarks:
score = search(index, test_instance, strategy="ngram|exact|vector")
if score > threshold:
mark_as_contaminated(test_instance)
clean_data = remove_contaminated_instances(training_data)