Implementation:Huggingface Datatrove ExactDedupSignature
| Knowledge Sources | |
|---|---|
| Domains | Data Deduplication, Data Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
The exact deduplication module implements a three-stage distributed pipeline (ExactDedupSignature, ExactFindDedups, ExactDedupFilter) that removes documents with identical content based on hash matching, along with ExactDedupConfig for configuration and ExactDedupBuildIndex for cross-dataset deduplication.
Description
This module provides exact content deduplication through a multi-stage approach. ExactDedupConfig is a dataclass that holds the configuration, including a content_getter callable that extracts the content to hash from each document, an optional document_priority callable for determining which duplicate to keep, a HashConfig for hash function settings, and an only_dedup_in_index flag.
ExactDedupSignature (Stage 1) is a PipelineStep that hashes each document's content using the configured hash function and stores the results as sorted binary signature files. Signatures are partitioned across finder_workers hash range buckets for parallel processing in Stage 2. Each signature contains the hash value, a priority value (1-65535), and the document ID.
ExactFindDedups (Stage 2) merges all sorted signature files using a heap-based priority queue, identifies documents sharing the same hash (exact duplicates), and writes out the document IDs of duplicates to remove. The highest-priority document is retained. It supports optional deduplication against a pre-built index for cross-dataset dedup.
ExactDedupFilter (Stage 3) loads the duplicate document IDs from Stage 2, iterates through the document pipeline, drops flagged duplicates (optionally saving them via an exclusion writer), and annotates surviving documents with a duplicate_count metadata field.
ExactDedupBuildIndex creates a hash-only index from signature files for cross-dataset deduplication.
Usage
Use this module when you need to remove documents with identical content from a large dataset. The three stages must be run in sequence, typically as separate pipeline executor tasks to allow distributed processing at each stage.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/dedup/exact_dedup.py
- Lines: 1-479
Signature
@dataclass
class ExactDedupConfig:
content_getter: Callable[[Document], bytes | str]
document_priority: Callable[[Document], int] | None = None
hash_config: HashConfig = field(default_factory=HashConfig)
only_dedup_in_index: bool = True
class ExactDedupSignature(PipelineStep):
def __init__(
self,
output_folder: DataFolderLike,
config: ExactDedupConfig,
finder_workers: int = 1,
):
class ExactFindDedups(PipelineStep):
def __init__(
self,
data_folder: DataFolderLike,
output_folder: DataFolderLike,
config: ExactDedupConfig,
save_cluster_size: bool = False,
index_folder: DataFolderLike | None = None,
lines_to_buffer: int = 5,
):
class ExactDedupFilter(PipelineStep):
def __init__(
self,
data_folder: DataFolderLike,
config: ExactDedupConfig,
exclusion_writer: DiskWriter | None = None,
):
Import
from datatrove.pipeline.dedup.exact_dedup import (
ExactDedupConfig,
ExactDedupSignature,
ExactFindDedups,
ExactDedupFilter,
ExactDedupBuildIndex,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes (Stage 1) | Folder where hash signatures are saved |
| config | ExactDedupConfig | Yes | Configuration with content_getter, priority function, and hash settings |
| finder_workers | int | No | Number of workers for Stage 2 (default: 1) |
| data_folder | DataFolderLike | Yes (Stages 2, 3) | Folder containing signature or duplicate files from previous stage |
| index_folder | DataFolderLike | No | Folder with pre-built index for cross-dataset dedup |
| exclusion_writer | DiskWriter | No | Writer to save excluded duplicate documents |
Outputs
| Name | Type | Description |
|---|---|---|
| Signature files | Binary | Sorted (hash, priority, doc_id) tuples partitioned by hash range |
| Duplicate files | Binary | Document IDs flagged as duplicates |
| Filtered documents | DocumentsPipeline | Documents with duplicates removed and duplicate_count metadata |
Usage Examples
Basic Usage
from datatrove.pipeline.dedup.exact_dedup import (
ExactDedupConfig,
ExactDedupSignature,
ExactFindDedups,
ExactDedupFilter,
)
config = ExactDedupConfig(
content_getter=lambda doc: doc.text,
document_priority=lambda doc: int(doc.metadata.get("quality_score", 1)),
)
# Stage 1: Generate signatures
stage1 = ExactDedupSignature(
output_folder="s3://bucket/dedup/sigs",
config=config,
finder_workers=4,
)
# Stage 2: Find duplicates
stage2 = ExactFindDedups(
data_folder="s3://bucket/dedup/sigs",
output_folder="s3://bucket/dedup/dups",
config=config,
)
# Stage 3: Filter documents
stage3 = ExactDedupFilter(
data_folder="s3://bucket/dedup/dups",
config=config,
)