Implementation:Huggingface Datatrove ContaminationStats
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Statistics |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
WordsContaminationStats is a statistics pipeline step that measures the frequency of specified contamination words within documents.
Description
WordsContaminationStats extends BaseStats to detect and quantify word-level contamination in documents. Given a list of target words, it tokenizes each document using a language-specific word tokenizer and computes the fraction of tokens that match any of the contamination words. This is useful for identifying documents that contain benchmark-specific terms, test set leakage indicators, or other undesirable vocabulary.
The class applies text normalization via a configurable TextNormConfig before tokenization, which can include lowercasing, accent removal, and other normalization steps. This ensures that contamination detection is robust to superficial text variations. The word tokenizer is loaded dynamically based on the specified language parameter.
The output statistic is named words_contamination_{first_word} where the first word in the contamination list is used as a label identifier. The value is the ratio of contamination word occurrences to total word count, ranging from 0.0 (no contamination) to 1.0 (all words are contamination words).
Usage
Use WordsContaminationStats when you need to measure the presence of specific words or phrases in your dataset, particularly for detecting benchmark contamination, test set leakage, or the prevalence of specific vocabulary items across a corpus.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/stats/contamination_stats.py
- Lines: 1-50
Signature
class WordsContaminationStats(BaseStats):
name = "😷 Words contamination"
def __init__(
self,
output_folder: DataFolderLike,
words: list[str],
norm_config: TextNormConfig = TextNormConfig(),
language: str = Languages.english,
groups_to_compute: list[GROUP] = list(get_args(GROUP)),
histogram_round_digits: int = 3,
top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
) -> None
def extract_stats(self, doc: Document) -> dict[str, int | float]:
...
Import
from datatrove.pipeline.stats.contamination_stats import WordsContaminationStats
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | Folder where computed statistics will be saved |
| words | list[str] | Yes | List of contamination words to search for; must contain at least one word |
| norm_config | TextNormConfig | No | Text normalization configuration applied before tokenization (default: TextNormConfig()) |
| language | str | No | Language for word tokenization (default: Languages.english) |
| groups_to_compute | list[GROUP] | No | Grouping strategies for statistics (default: all groups) |
| histogram_round_digits | int | No | Decimal digits for histogram rounding (default: 3) |
| top_k_config | TopKConfig | No | Top-K configuration for high-cardinality groups |
Outputs
| Name | Type | Description |
|---|---|---|
| words_contamination_{word} | float | Ratio of contamination word occurrences to total words in the document (0.0 to 1.0) |
Usage Examples
Basic Usage
from datatrove.pipeline.stats.contamination_stats import WordsContaminationStats
# Detect benchmark contamination words
stats = WordsContaminationStats(
output_folder="output/stats/",
words=["benchmark", "test", "evaluation"],
)
With Custom Normalization
from datatrove.pipeline.stats.contamination_stats import WordsContaminationStats
from datatrove.utils.text import TextNormConfig
from datatrove.utils.typeshelper import Languages
stats = WordsContaminationStats(
output_folder="output/stats/",
words=["lorem", "ipsum"],
norm_config=TextNormConfig(lowercase=True),
language=Languages.english,
groups_to_compute=["summary", "histogram"],
)