Principle:Huggingface Datatrove Exact Deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data Deduplication, Data Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Exact Deduplication is the principle of removing documents with identical content from a dataset by computing deterministic hash values of document content and identifying collisions across the corpus.
Description
Exact deduplication is a fundamental data quality technique that identifies and removes documents whose content is byte-for-byte identical. Unlike fuzzy or approximate deduplication (such as MinHash or SimHash), exact deduplication guarantees that only truly identical documents are matched, producing zero false positives. This is accomplished by computing a cryptographic or non-cryptographic hash of each document's content and grouping documents by hash value. Within each group of identical documents, a priority function determines which single document to retain.
The approach is especially important for training data preparation in machine learning, where duplicate documents can bias model training by overrepresenting certain texts. The distributed, multi-stage design allows exact deduplication to scale to billions of documents by partitioning the hash space across workers and processing each partition independently.
Usage
Apply exact deduplication as an early stage in data processing pipelines when you need to remove identical documents from large corpora. It is most effective for catching verbatim copies and should be combined with other deduplication methods (sentence-level, MinHash) for comprehensive duplicate removal.
Theoretical Basis
The exact deduplication algorithm operates in three stages:
- Stage 1 -- Signature Generation: Each document's content is passed through a hash function (configurable via HashConfig) to produce a fixed-size hash. Signatures are stored as sorted (hash, priority, doc_id) tuples and partitioned into buckets by hash range, enabling embarrassingly parallel processing in Stage 2.
- Stage 2 -- Duplicate Detection: Sorted signature files from all workers are merged using a k-way merge with a min-heap (priority queue). As the merge proceeds, consecutive entries with the same hash value are identified as duplicates. The document with the highest priority in each duplicate group is retained; all others are marked for removal.
- Stage 3 -- Filtering: The original document stream is replayed and documents whose IDs appear in the duplicate list are dropped (or saved via an exclusion writer for auditing). Surviving documents are annotated with the size of their duplicate cluster.
Priority-Based Retention: When multiple documents share the same hash, the configurable priority function (an integer in range [1, 65535]) determines which to keep. This enables domain-specific retention policies such as preferring newer documents, higher-quality sources, or specific datasets.
Cross-Dataset Deduplication: A pre-built index of hashes from a reference dataset can be used in Stage 2 to deduplicate new data against previously processed corpora without reprocessing the reference data.