Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Princeton nlp SimPO Benchmark Decontamination

From Leeroopedia


Knowledge Sources
Domains Data_Quality, Evaluation, NLP
Last Updated 2026-02-08 04:30 GMT

Overview

A data filtering technique that removes benchmark evaluation content from training datasets to prevent artificially inflated evaluation scores.

Description

Benchmark Decontamination is a data quality practice that detects and removes training samples containing content from evaluation benchmarks. When training data overlaps with evaluation benchmarks (e.g., HumanEval, MBPP), the model effectively memorizes evaluation answers, producing inflated scores that do not reflect genuine capability. Decontamination addresses this by scanning training text for substring matches against known benchmark content (docstrings, prompts, canonical solutions) and excluding contaminated samples. A whitelist of trivially simple patterns (e.g., return x + y) avoids over-filtering common idioms that appear in both benchmarks and legitimate training data.

Usage

Apply this principle when curating training data for supervised fine-tuning or preference optimization where the model will later be evaluated on code generation benchmarks such as HumanEval. It is essential for any training pipeline that draws from broad web-scraped or code-based corpora where benchmark content may inadvertently appear.

Theoretical Basis

The core mechanism is substring containment checking with normalization:

Pseudo-code Logic:

# Abstract decontamination algorithm
benchmark_strings = load_benchmark_docstrings() + load_benchmark_solutions()
trivial_strings = define_trivial_patterns()

for sample in training_data:
    normalized_sample = normalize(sample.lower())
    contaminated = False
    for ref_string in benchmark_strings:
        if ref_string not in trivial_strings:
            if normalize(ref_string.lower()) in normalized_sample:
                contaminated = True
                break
    if not contaminated:
        yield sample  # Keep only clean samples

Key design decisions:

  • Case-insensitive matching reduces false negatives from formatting differences
  • Whitespace normalization collapses formatting variations into a canonical form
  • Trivial pattern whitelist prevents removal of ubiquitous code idioms that happen to appear in benchmarks

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment