Implementation:Huggingface Datatrove ContaminationStats

Knowledge Sources	Huggingface_Datatrove
Domains	Data Quality, Statistics
Last Updated	2026-02-14 17:00 GMT

Overview

WordsContaminationStats is a statistics pipeline step that measures the frequency of specified contamination words within documents.

Description

WordsContaminationStats extends BaseStats to detect and quantify word-level contamination in documents. Given a list of target words, it tokenizes each document using a language-specific word tokenizer and computes the fraction of tokens that match any of the contamination words. This is useful for identifying documents that contain benchmark-specific terms, test set leakage indicators, or other undesirable vocabulary.

The class applies text normalization via a configurable TextNormConfig before tokenization, which can include lowercasing, accent removal, and other normalization steps. This ensures that contamination detection is robust to superficial text variations. The word tokenizer is loaded dynamically based on the specified language parameter.

The output statistic is named words_contamination_{first_word} where the first word in the contamination list is used as a label identifier. The value is the ratio of contamination word occurrences to total word count, ranging from 0.0 (no contamination) to 1.0 (all words are contamination words).

Usage

Use WordsContaminationStats when you need to measure the presence of specific words or phrases in your dataset, particularly for detecting benchmark contamination, test set leakage, or the prevalence of specific vocabulary items across a corpus.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/stats/contamination_stats.py
Lines: 1-50

Signature

class WordsContaminationStats(BaseStats):
    name = "😷 Words contamination"

    def __init__(
        self,
        output_folder: DataFolderLike,
        words: list[str],
        norm_config: TextNormConfig = TextNormConfig(),
        language: str = Languages.english,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

Import

from datatrove.pipeline.stats.contamination_stats import WordsContaminationStats

I/O Contract

Inputs

Name	Type	Required	Description
output_folder	DataFolderLike	Yes	Folder where computed statistics will be saved
words	list[str]	Yes	List of contamination words to search for; must contain at least one word
norm_config	TextNormConfig	No	Text normalization configuration applied before tokenization (default: TextNormConfig())
language	str	No	Language for word tokenization (default: Languages.english)
groups_to_compute	list[GROUP]	No	Grouping strategies for statistics (default: all groups)
histogram_round_digits	int	No	Decimal digits for histogram rounding (default: 3)
top_k_config	TopKConfig	No	Top-K configuration for high-cardinality groups

Outputs

Name	Type	Description
words_contamination_{word}	float	Ratio of contamination word occurrences to total words in the document (0.0 to 1.0)

Usage Examples

Basic Usage

from datatrove.pipeline.stats.contamination_stats import WordsContaminationStats

# Detect benchmark contamination words
stats = WordsContaminationStats(
    output_folder="output/stats/",
    words=["benchmark", "test", "evaluation"],
)

With Custom Normalization

from datatrove.pipeline.stats.contamination_stats import WordsContaminationStats
from datatrove.utils.text import TextNormConfig
from datatrove.utils.typeshelper import Languages

stats = WordsContaminationStats(
    output_folder="output/stats/",
    words=["lorem", "ipsum"],
    norm_config=TextNormConfig(lowercase=True),
    language=Languages.english,
    groups_to_compute=["summary", "histogram"],
)

Related Pages

Principle:Huggingface_Datatrove_Contamination_Statistics

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment