Principle:Huggingface Datatrove Contamination Statistics
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Statistics |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Contamination Statistics is the principle of measuring the presence and frequency of specific indicator words within documents to detect data contamination or benchmark leakage.
Description
In the context of training large language models, data contamination refers to the inadvertent inclusion of evaluation benchmark data (or close paraphrases thereof) in training corpora. When a model is trained on data that overlaps with its evaluation sets, benchmark scores become inflated and unreliable. Contamination statistics provide a quantitative measure of this risk by scanning documents for the presence of known contamination indicator words.
The approach involves tokenizing document text with a language-appropriate word tokenizer, optionally normalizing the text first (e.g., lowercasing, removing accents) to increase recall, and then computing the ratio of matching tokens to total tokens. This produces a per-document contamination score that can be aggregated across groups (by domain, by TLD suffix) or analyzed as a distribution (via histograms) to identify contamination hotspots in the corpus.
Usage
Apply this principle when curating training datasets for language models and you need to ensure that benchmark or evaluation data has not leaked into the training corpus. It is also applicable for detecting the prevalence of any targeted vocabulary across a dataset.
Theoretical Basis
Key concepts in contamination statistics include:
- Word-level matching: Contamination is detected at the word level rather than the character or n-gram level. This provides a balance between precision (avoiding partial matches) and recall (catching morphological variants via normalization).
- Text normalization: Applying normalization before matching (lowercasing, accent removal, punctuation stripping) ensures that trivial text variations do not cause contamination to go undetected.
- Frequency ratio: The contamination score is expressed as the ratio of matching words to total words, providing a normalized metric that is comparable across documents of different lengths.
- Indicator word selection: The choice of contamination words is application-dependent. Common choices include words specific to benchmark datasets (e.g., unique named entities or task-specific vocabulary) that would be unlikely to appear in natural web text.
- Language-aware tokenization: Using language-specific word tokenizers ensures correct word boundary detection across languages with different whitespace and punctuation conventions.