Workflow:Huggingface Datatrove Minhash Deduplication

Knowledge Sources	Datatrove MinHash LSH Datatrove README
Domains	Data_Engineering, Deduplication, NLP
Last Updated	2026-02-14 17:00 GMT

Overview

Four-stage distributed pipeline for fuzzy near-duplicate document detection and removal using MinHash Locality-Sensitive Hashing (LSH).

Description

This workflow implements large-scale fuzzy deduplication of text documents using MinHash signatures and LSH banding. Unlike exact deduplication, MinHash identifies documents that are similar but not identical, catching near-duplicates such as boilerplate pages with minor variations. The process is split into four dependent stages that must run sequentially: signature computation, bucket matching, cluster formation via union-find, and final filtering. Each stage communicates through intermediate files, enabling distributed execution across thousands of tasks on a Slurm cluster.

Usage

Execute this workflow after initial text extraction and quality filtering (e.g., after the Common Crawl Processing workflow) when you need to remove near-duplicate documents from a large corpus. This is essential for LLM pretraining data to avoid training on repetitive content, which degrades model quality.

Execution Steps

Step 1: Compute MinHash Signatures

For each document, generate a MinHash signature by shingling the text into n-grams (default 5-grams), hashing each shingle, and selecting minimum hash values across multiple hash permutations. The signatures are organized into bands (buckets) for the LSH scheme. Each task processes its shard of input files and writes signature files to an intermediate output folder.

Key considerations:

The number of bands and hashes per band control the similarity threshold
Hash function choice (SHA1 vs xxHash) and precision (32-bit vs 64-bit) affect false positive rates
Language-specific word tokenization is available for non-English corpora

Step 2: Find Bucket Matches

Read all signatures for each LSH bucket and identify document pairs that share at least one bucket (i.e., have identical band signatures). This stage can be parallelized per bucket, with multiple workers processing sub-ranges within each bucket. The output is a set of candidate duplicate pairs per bucket.

Key considerations:

Parallelism is per-bucket: total tasks = num_buckets x workers_per_bucket
Memory requirements scale with the number of signatures per bucket
This stage reads signatures from Stage 1 and writes bucket match files

Step 3: Cluster Duplicates

Merge all bucket match results and build connected components using a union-find data structure. Each connected component represents a cluster of near-duplicate documents. For each cluster, one document is designated as the "keeper" (typically the first encountered), and all others are marked for removal. This stage must run on a single task due to the global nature of the union-find operation.

Key considerations:

Runs on a single task with high memory (the entire duplicate graph must fit in RAM)
A Rust implementation (fast_mh3) is available as a high-performance drop-in replacement
Output is a set of document IDs to remove

Step 4: Filter Duplicates

Re-read the original input data and remove all documents whose IDs appear in the removal set from Stage 3. The filtered data is written to the final output location. Optionally count tokens before and after deduplication for statistics, and apply additional processing (such as PII removal) on the deduplicated output.

Key considerations:

Input reader must match Stage 1 exactly (same tasks, same source)
Token counting provides before/after dedup statistics
Excluded documents can be written to a separate output for auditing

Execution Diagram

GitHub URL

Workflow Repository