Implementation:Huggingface Datatrove NGramDecontamination
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, NLP |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Implements n-gram-based decontamination to remove training documents that contain text overlapping with evaluation benchmark tasks, preventing benchmark data leakage.
Description
The NGramDecontamination module provides a two-stage pipeline for detecting and removing benchmark contamination from training data. The first stage, NGramsDecontIndexer, builds a hash index of n-grams (default 12-grams) extracted from evaluation task data. The second stage, NGramsDecontFilter, loads the index and filters training documents that contain any matching n-grams.
The indexer supports multiple sources of evaluation data: documents from a previous pipeline step (with "text" as the label and "query" in metadata), or lighteval-defined benchmark tasks loaded via the lighteval library. For each evaluation example, it tokenizes and normalizes both the label (answer) and optionally the query (prompt), then computes n-grams from the label, from the query (if find_query_ngrams is enabled), and from the overlapping span between query and label (if find_overlap_ngrams is enabled). Each n-gram is hashed and stored per task name as a binary numpy array of uint64 values.
The filter loads all per-task hash files into a single dictionary mapping hashes to task names. For each training document, it tokenizes and normalizes the text, computes n-gram hashes, and checks each hash against the index. If any match is found, the document is removed and annotated with the contaminated n-gram text and the source task name in its metadata.
The NGramsDecontConfig dataclass controls the n-gram size, which n-gram types to compute, and the text normalization and hash configuration.
Usage
Use this module when preparing training data for language models to ensure that evaluation benchmark answers do not appear in the training set. Run the indexer first to build the hash index from your evaluation tasks, then run the filter on your training data.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/decont/n_grams.py
- Lines: 1-227
Signature
@dataclass
class NGramsDecontConfig:
n_grams: int = 12
find_query_ngrams: bool = False
find_overlap_ngrams: bool = True
norm_config: TextNormConfig = field(default_factory=TextNormConfig)
hash_config: HashConfig = field(default_factory=HashConfig)
class NGramsDecontIndexer(PipelineStep):
def __init__(
self,
output_folder: DataFolderLike,
lighteval_tasks: str | list[str] | None = None,
custom_lighteval_tasks: str | None = None,
config: NGramsDecontConfig = None,
language: str = Languages.english,
): ...
def compute_hashes(self, label: str, query: str | None = None) -> list[int]: ...
def run(self, data: DocumentsPipeline = None, rank: int = 0, world_size: int = 1): ...
class NGramsDecontFilter(BaseFilter):
def __init__(
self,
index_folder: DataFolderLike,
config: NGramsDecontConfig = None,
exclusion_writer: DiskWriter = None,
language: str = Languages.english,
): ...
def filter(self, doc: Document) -> bool | Tuple[bool, str]: ...
Import
from datatrove.pipeline.decont.n_grams import NGramsDecontIndexer, NGramsDecontFilter, NGramsDecontConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes (Indexer) | Where to save the hash index files |
| index_folder | DataFolderLike | Yes (Filter) | Where to load the hash index files from |
| lighteval_tasks | str or list[str] | No | task" format or path to a task list file |
| config | NGramsDecontConfig | No | Configuration for n-gram size, normalization, and hashing |
| language | str | No | Language for word tokenization (default: English) |
| exclusion_writer | DiskWriter | No | Writer for saving removed contaminated documents |
Outputs
| Name | Type | Description |
|---|---|---|
| Hash index files | binary files | Per-task .index.hashes files containing uint64 hash arrays (Indexer) |
| Filtered documents | DocumentsPipeline | Documents with contaminated entries removed (Filter) |
| Metadata annotations | dict | "contaminated_ngram" and "contaminated_task" added to removed document metadata |
Usage Examples
Basic Usage
from datatrove.pipeline.decont.n_grams import (
NGramsDecontIndexer,
NGramsDecontFilter,
NGramsDecontConfig,
)
from datatrove.executor.local import LocalPipelineExecutor
# Step 1: Build the decontamination index
indexer_executor = LocalPipelineExecutor(
pipeline=[
NGramsDecontIndexer(
output_folder="decontamination_index/",
lighteval_tasks=["leaderboard|hellaswag|5|1"],
config=NGramsDecontConfig(n_grams=12),
)
],
tasks=1,
logging_dir="logs/decontamination_index",
)
# Step 2: Filter training data
filter_executor = LocalPipelineExecutor(
pipeline=[
NGramsDecontFilter(
index_folder="decontamination_index/",
config=NGramsDecontConfig(n_grams=12),
)
],
tasks=8,
depends=indexer_executor,
logging_dir="logs/decontamination_filter",
)
filter_executor.run()