Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove ContaminationStats

From Leeroopedia
Knowledge Sources
Domains Data Quality, Statistics
Last Updated 2026-02-14 17:00 GMT

Overview

WordsContaminationStats is a statistics pipeline step that measures the frequency of specified contamination words within documents.

Description

WordsContaminationStats extends BaseStats to detect and quantify word-level contamination in documents. Given a list of target words, it tokenizes each document using a language-specific word tokenizer and computes the fraction of tokens that match any of the contamination words. This is useful for identifying documents that contain benchmark-specific terms, test set leakage indicators, or other undesirable vocabulary.

The class applies text normalization via a configurable TextNormConfig before tokenization, which can include lowercasing, accent removal, and other normalization steps. This ensures that contamination detection is robust to superficial text variations. The word tokenizer is loaded dynamically based on the specified language parameter.

The output statistic is named words_contamination_{first_word} where the first word in the contamination list is used as a label identifier. The value is the ratio of contamination word occurrences to total word count, ranging from 0.0 (no contamination) to 1.0 (all words are contamination words).

Usage

Use WordsContaminationStats when you need to measure the presence of specific words or phrases in your dataset, particularly for detecting benchmark contamination, test set leakage, or the prevalence of specific vocabulary items across a corpus.

Code Reference

Source Location

Signature

class WordsContaminationStats(BaseStats):
    name = "😷 Words contamination"

    def __init__(
        self,
        output_folder: DataFolderLike,
        words: list[str],
        norm_config: TextNormConfig = TextNormConfig(),
        language: str = Languages.english,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

Import

from datatrove.pipeline.stats.contamination_stats import WordsContaminationStats

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes Folder where computed statistics will be saved
words list[str] Yes List of contamination words to search for; must contain at least one word
norm_config TextNormConfig No Text normalization configuration applied before tokenization (default: TextNormConfig())
language str No Language for word tokenization (default: Languages.english)
groups_to_compute list[GROUP] No Grouping strategies for statistics (default: all groups)
histogram_round_digits int No Decimal digits for histogram rounding (default: 3)
top_k_config TopKConfig No Top-K configuration for high-cardinality groups

Outputs

Name Type Description
words_contamination_{word} float Ratio of contamination word occurrences to total words in the document (0.0 to 1.0)

Usage Examples

Basic Usage

from datatrove.pipeline.stats.contamination_stats import WordsContaminationStats

# Detect benchmark contamination words
stats = WordsContaminationStats(
    output_folder="output/stats/",
    words=["benchmark", "test", "evaluation"],
)

With Custom Normalization

from datatrove.pipeline.stats.contamination_stats import WordsContaminationStats
from datatrove.utils.text import TextNormConfig
from datatrove.utils.typeshelper import Languages

stats = WordsContaminationStats(
    output_folder="output/stats/",
    words=["lorem", "ipsum"],
    norm_config=TextNormConfig(lowercase=True),
    language=Languages.english,
    groups_to_compute=["summary", "histogram"],
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment