Workflow:Huggingface Datatrove Minhash Deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Deduplication, NLP |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Four-stage distributed pipeline for fuzzy near-duplicate document detection and removal using MinHash Locality-Sensitive Hashing (LSH).
Description
This workflow implements large-scale fuzzy deduplication of text documents using MinHash signatures and LSH banding. Unlike exact deduplication, MinHash identifies documents that are similar but not identical, catching near-duplicates such as boilerplate pages with minor variations. The process is split into four dependent stages that must run sequentially: signature computation, bucket matching, cluster formation via union-find, and final filtering. Each stage communicates through intermediate files, enabling distributed execution across thousands of tasks on a Slurm cluster.
Usage
Execute this workflow after initial text extraction and quality filtering (e.g., after the Common Crawl Processing workflow) when you need to remove near-duplicate documents from a large corpus. This is essential for LLM pretraining data to avoid training on repetitive content, which degrades model quality.
Execution Steps
Step 1: Compute MinHash Signatures
For each document, generate a MinHash signature by shingling the text into n-grams (default 5-grams), hashing each shingle, and selecting minimum hash values across multiple hash permutations. The signatures are organized into bands (buckets) for the LSH scheme. Each task processes its shard of input files and writes signature files to an intermediate output folder.
Key considerations:
- The number of bands and hashes per band control the similarity threshold
- Hash function choice (SHA1 vs xxHash) and precision (32-bit vs 64-bit) affect false positive rates
- Language-specific word tokenization is available for non-English corpora
Step 2: Find Bucket Matches
Read all signatures for each LSH bucket and identify document pairs that share at least one bucket (i.e., have identical band signatures). This stage can be parallelized per bucket, with multiple workers processing sub-ranges within each bucket. The output is a set of candidate duplicate pairs per bucket.
Key considerations:
- Parallelism is per-bucket: total tasks = num_buckets x workers_per_bucket
- Memory requirements scale with the number of signatures per bucket
- This stage reads signatures from Stage 1 and writes bucket match files
Step 3: Cluster Duplicates
Merge all bucket match results and build connected components using a union-find data structure. Each connected component represents a cluster of near-duplicate documents. For each cluster, one document is designated as the "keeper" (typically the first encountered), and all others are marked for removal. This stage must run on a single task due to the global nature of the union-find operation.
Key considerations:
- Runs on a single task with high memory (the entire duplicate graph must fit in RAM)
- A Rust implementation (fast_mh3) is available as a high-performance drop-in replacement
- Output is a set of document IDs to remove
Step 4: Filter Duplicates
Re-read the original input data and remove all documents whose IDs appear in the removal set from Stage 3. The filtered data is written to the final output location. Optionally count tokens before and after deduplication for statistics, and apply additional processing (such as PII removal) on the deduplicated output.
Key considerations:
- Input reader must match Stage 1 exactly (same tasks, same source)
- Token counting provides before/after dedup statistics
- Excluded documents can be written to a separate output for auditing