Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove NGramDecontamination

From Leeroopedia
Revision as of 13:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datatrove_NGramDecontamination.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data Quality, NLP
Last Updated 2026-02-14 17:00 GMT

Overview

Implements n-gram-based decontamination to remove training documents that contain text overlapping with evaluation benchmark tasks, preventing benchmark data leakage.

Description

The NGramDecontamination module provides a two-stage pipeline for detecting and removing benchmark contamination from training data. The first stage, NGramsDecontIndexer, builds a hash index of n-grams (default 12-grams) extracted from evaluation task data. The second stage, NGramsDecontFilter, loads the index and filters training documents that contain any matching n-grams.

The indexer supports multiple sources of evaluation data: documents from a previous pipeline step (with "text" as the label and "query" in metadata), or lighteval-defined benchmark tasks loaded via the lighteval library. For each evaluation example, it tokenizes and normalizes both the label (answer) and optionally the query (prompt), then computes n-grams from the label, from the query (if find_query_ngrams is enabled), and from the overlapping span between query and label (if find_overlap_ngrams is enabled). Each n-gram is hashed and stored per task name as a binary numpy array of uint64 values.

The filter loads all per-task hash files into a single dictionary mapping hashes to task names. For each training document, it tokenizes and normalizes the text, computes n-gram hashes, and checks each hash against the index. If any match is found, the document is removed and annotated with the contaminated n-gram text and the source task name in its metadata.

The NGramsDecontConfig dataclass controls the n-gram size, which n-gram types to compute, and the text normalization and hash configuration.

Usage

Use this module when preparing training data for language models to ensure that evaluation benchmark answers do not appear in the training set. Run the indexer first to build the hash index from your evaluation tasks, then run the filter on your training data.

Code Reference

Source Location

Signature

@dataclass
class NGramsDecontConfig:
    n_grams: int = 12
    find_query_ngrams: bool = False
    find_overlap_ngrams: bool = True
    norm_config: TextNormConfig = field(default_factory=TextNormConfig)
    hash_config: HashConfig = field(default_factory=HashConfig)

class NGramsDecontIndexer(PipelineStep):
    def __init__(
        self,
        output_folder: DataFolderLike,
        lighteval_tasks: str | list[str] | None = None,
        custom_lighteval_tasks: str | None = None,
        config: NGramsDecontConfig = None,
        language: str = Languages.english,
    ): ...
    def compute_hashes(self, label: str, query: str | None = None) -> list[int]: ...
    def run(self, data: DocumentsPipeline = None, rank: int = 0, world_size: int = 1): ...

class NGramsDecontFilter(BaseFilter):
    def __init__(
        self,
        index_folder: DataFolderLike,
        config: NGramsDecontConfig = None,
        exclusion_writer: DiskWriter = None,
        language: str = Languages.english,
    ): ...
    def filter(self, doc: Document) -> bool | Tuple[bool, str]: ...

Import

from datatrove.pipeline.decont.n_grams import NGramsDecontIndexer, NGramsDecontFilter, NGramsDecontConfig

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes (Indexer) Where to save the hash index files
index_folder DataFolderLike Yes (Filter) Where to load the hash index files from
lighteval_tasks str or list[str] No task" format or path to a task list file
config NGramsDecontConfig No Configuration for n-gram size, normalization, and hashing
language str No Language for word tokenization (default: English)
exclusion_writer DiskWriter No Writer for saving removed contaminated documents

Outputs

Name Type Description
Hash index files binary files Per-task .index.hashes files containing uint64 hash arrays (Indexer)
Filtered documents DocumentsPipeline Documents with contaminated entries removed (Filter)
Metadata annotations dict "contaminated_ngram" and "contaminated_task" added to removed document metadata

Usage Examples

Basic Usage

from datatrove.pipeline.decont.n_grams import (
    NGramsDecontIndexer,
    NGramsDecontFilter,
    NGramsDecontConfig,
)
from datatrove.executor.local import LocalPipelineExecutor

# Step 1: Build the decontamination index
indexer_executor = LocalPipelineExecutor(
    pipeline=[
        NGramsDecontIndexer(
            output_folder="decontamination_index/",
            lighteval_tasks=["leaderboard|hellaswag|5|1"],
            config=NGramsDecontConfig(n_grams=12),
        )
    ],
    tasks=1,
    logging_dir="logs/decontamination_index",
)

# Step 2: Filter training data
filter_executor = LocalPipelineExecutor(
    pipeline=[
        NGramsDecontFilter(
            index_folder="decontamination_index/",
            config=NGramsDecontConfig(n_grams=12),
        )
    ],
    tasks=8,
    depends=indexer_executor,
    logging_dir="logs/decontamination_filter",
)
filter_executor.run()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment