Implementation:Huggingface Datatrove NGramDecontamination

Knowledge Sources	Huggingface_Datatrove
Domains	Data Quality, NLP
Last Updated	2026-02-14 17:00 GMT

Overview

Implements n-gram-based decontamination to remove training documents that contain text overlapping with evaluation benchmark tasks, preventing benchmark data leakage.

Description

The NGramDecontamination module provides a two-stage pipeline for detecting and removing benchmark contamination from training data. The first stage, NGramsDecontIndexer, builds a hash index of n-grams (default 12-grams) extracted from evaluation task data. The second stage, NGramsDecontFilter, loads the index and filters training documents that contain any matching n-grams.

The indexer supports multiple sources of evaluation data: documents from a previous pipeline step (with "text" as the label and "query" in metadata), or lighteval-defined benchmark tasks loaded via the lighteval library. For each evaluation example, it tokenizes and normalizes both the label (answer) and optionally the query (prompt), then computes n-grams from the label, from the query (if find_query_ngrams is enabled), and from the overlapping span between query and label (if find_overlap_ngrams is enabled). Each n-gram is hashed and stored per task name as a binary numpy array of uint64 values.

The filter loads all per-task hash files into a single dictionary mapping hashes to task names. For each training document, it tokenizes and normalizes the text, computes n-gram hashes, and checks each hash against the index. If any match is found, the document is removed and annotated with the contaminated n-gram text and the source task name in its metadata.

The NGramsDecontConfig dataclass controls the n-gram size, which n-gram types to compute, and the text normalization and hash configuration.

Usage

Use this module when preparing training data for language models to ensure that evaluation benchmark answers do not appear in the training set. Run the indexer first to build the hash index from your evaluation tasks, then run the filter on your training data.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/decont/n_grams.py
Lines: 1-227

Signature

@dataclass
class NGramsDecontConfig:
    n_grams: int = 12
    find_query_ngrams: bool = False
    find_overlap_ngrams: bool = True
    norm_config: TextNormConfig = field(default_factory=TextNormConfig)
    hash_config: HashConfig = field(default_factory=HashConfig)

class NGramsDecontIndexer(PipelineStep):
    def __init__(
        self,
        output_folder: DataFolderLike,
        lighteval_tasks: str | list[str] | None = None,
        custom_lighteval_tasks: str | None = None,
        config: NGramsDecontConfig = None,
        language: str = Languages.english,
    ): ...
    def compute_hashes(self, label: str, query: str | None = None) -> list[int]: ...
    def run(self, data: DocumentsPipeline = None, rank: int = 0, world_size: int = 1): ...

class NGramsDecontFilter(BaseFilter):
    def __init__(
        self,
        index_folder: DataFolderLike,
        config: NGramsDecontConfig = None,
        exclusion_writer: DiskWriter = None,
        language: str = Languages.english,
    ): ...
    def filter(self, doc: Document) -> bool | Tuple[bool, str]: ...

Import

from datatrove.pipeline.decont.n_grams import NGramsDecontIndexer, NGramsDecontFilter, NGramsDecontConfig

I/O Contract

Inputs

Name	Type	Required	Description
output_folder	DataFolderLike	Yes (Indexer)	Where to save the hash index files
index_folder	DataFolderLike	Yes (Filter)	Where to load the hash index files from
lighteval_tasks	str or list[str]	No	task" format or path to a task list file
config	NGramsDecontConfig	No	Configuration for n-gram size, normalization, and hashing
language	str	No	Language for word tokenization (default: English)
exclusion_writer	DiskWriter	No	Writer for saving removed contaminated documents

Outputs

Name	Type	Description
Hash index files	binary files	Per-task .index.hashes files containing uint64 hash arrays (Indexer)
Filtered documents	DocumentsPipeline	Documents with contaminated entries removed (Filter)
Metadata annotations	dict	"contaminated_ngram" and "contaminated_task" added to removed document metadata

Usage Examples

Basic Usage

from datatrove.pipeline.decont.n_grams import (
    NGramsDecontIndexer,
    NGramsDecontFilter,
    NGramsDecontConfig,
)
from datatrove.executor.local import LocalPipelineExecutor

# Step 1: Build the decontamination index
indexer_executor = LocalPipelineExecutor(
    pipeline=[
        NGramsDecontIndexer(
            output_folder="decontamination_index/",
            lighteval_tasks=["leaderboard|hellaswag|5|1"],
            config=NGramsDecontConfig(n_grams=12),
        )
    ],
    tasks=1,
    logging_dir="logs/decontamination_index",
)

# Step 2: Filter training data
filter_executor = LocalPipelineExecutor(
    pipeline=[
        NGramsDecontFilter(
            index_folder="decontamination_index/",
            config=NGramsDecontConfig(n_grams=12),
        )
    ],
    tasks=8,
    depends=indexer_executor,
    logging_dir="logs/decontamination_filter",
)
filter_executor.run()

Related Pages

Principle:Huggingface_Datatrove_NGram_Decontamination

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment