Principle:Huggingface Datatrove NGram Decontamination

Knowledge Sources	Huggingface_Datatrove
Domains	Data Quality, NLP
Last Updated	2026-02-14 17:00 GMT

Overview

N-gram decontamination is the process of detecting and removing training data that overlaps with evaluation benchmark examples, preventing data leakage that would inflate benchmark scores.

Description

Benchmark contamination occurs when a language model's training data contains text from evaluation benchmarks. If the model has memorized answers from the benchmark during training, its evaluation scores will be artificially inflated and no longer reflect genuine generalization ability. This is a well-documented problem in modern NLP, particularly as training corpora grow to encompass large portions of the internet.

N-gram decontamination addresses this by identifying shared n-gram sequences between evaluation tasks and training documents. An n-gram of sufficient length (typically 12 or more words) appearing in both the benchmark answer and a training document is strong evidence of contamination, as such long sequences are unlikely to occur by coincidence. The approach uses text normalization (lowercasing, punctuation removal, number normalization) before comparison to catch near-matches that differ only in surface formatting.

The method works in two phases: an indexing phase that extracts and hashes n-grams from all evaluation tasks, and a filtering phase that checks each training document against the hash index. Hashing reduces memory usage and enables constant-time lookups, making the approach practical for billions of training documents.

Usage

Apply n-gram decontamination whenever preparing training data for a language model that will be evaluated on standard benchmarks. Run the indexing phase on all evaluation task data, then filter the training corpus before training begins.

Theoretical Basis

N-gram Overlap Detection: Given a training document D and an evaluation example E, the document is considered contaminated if any contiguous sequence of n tokens from E also appears in D after normalization. The choice of n involves a precision-recall tradeoff: smaller n catches more contamination but increases false positives; larger n is more precise but may miss paraphrased contamination. A value of n=12 is commonly used as a good balance.

Hash-Based Indexing: Rather than storing full n-gram strings, each n-gram is hashed to a fixed-size integer (typically 64-bit). This reduces memory usage from O(n * k) per n-gram (where k is average token length) to O(8) bytes per n-gram. Hash collisions are theoretically possible but negligible with 64-bit hashes for typical evaluation set sizes.

Query-Label Overlap: Evaluation examples have two parts: a query/prompt and a label/answer. Three types of n-grams can be checked:

Label n-grams: Sequences entirely within the answer (always checked)
Query n-grams: Sequences entirely within the prompt (optional, usually disabled)
Overlap n-grams: Sequences spanning the boundary between prompt and answer (usually enabled)

Overlap n-grams are important because they catch cases where the training data contains the full benchmark example (prompt followed by answer) rather than just the answer in isolation.

Text Normalization for Recall: Normalization increases recall by ensuring that superficial differences (capitalization, extra spaces, different number formatting) do not cause missed matches. The tradeoff is a slight increase in false positives, which is generally acceptable since over-filtering is preferred to under-filtering for contamination.

Related Pages

Implementation:Huggingface_Datatrove_NGramDecontamination

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment