Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct Data Decontamination

From Leeroopedia


Knowledge Sources
Domains Data_Quality, Evaluation
Last Updated 2026-02-07 02:00 GMT

Overview

Principle of detecting and removing evaluation benchmark data from training datasets to ensure valid and unbiased model evaluation results.

Description

Data decontamination addresses the critical problem of test set leakage in large-scale language model training. When training data overlaps with evaluation benchmarks, reported performance metrics become inflated and unreliable. The decontamination process involves two phases: (1) indexing training data into a searchable form, and (2) querying that index with evaluation test sets to identify and quantify contamination. Multiple matching strategies exist with different precision/recall tradeoffs: exact string matching (high precision, low recall), n-gram overlap (balanced), and semantic vector similarity (high recall, lower precision). The detected contamination can be used to produce cleaned training datasets or to report contamination rates alongside evaluation results.

Usage

Apply this principle when preparing training datasets for instruction tuning, especially before reporting benchmark results. It is essential for maintaining evaluation integrity in the Tulu model family and any research that claims performance improvements on standard benchmarks.

Theoretical Basis

The contamination detection process uses information retrieval techniques:

Contaminationexact(q,D)={1if dD:qd0otherwise

For n-gram matching, contamination is measured as token coverage: Coverage(q,d)=|{tq: n-gram containing t found in d}||q|

For vector matching, semantic similarity is computed via: Similarity(q,d)=E(q)E(d)||E(q)||||E(d)||

where E is an embedding model (e.g., NV-Embed-v2).

Pseudo-code Logic:

# Abstract decontamination pipeline
index = build_search_index(training_data)
for test_instance in evaluation_benchmarks:
    score = search(index, test_instance, strategy="ngram|exact|vector")
    if score > threshold:
        mark_as_contaminated(test_instance)
clean_data = remove_contaminated_instances(training_data)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment