Principle:Huggingface Datatrove MinHash Bucket Matching
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Finding candidate duplicate pairs by matching documents with identical hash signatures within LSH buckets. This stage performs a distributed merge-sort across all signature files within each bucket to identify document pairs that share the same band signature, indicating they are potential near-duplicates.
Description
Bucket matching is the second stage of the MinHash deduplication pipeline. It takes the sorted binary signature files produced by the signature computation stage and identifies all document pairs that share an identical band signature within the same bucket.
The matching process works as follows:
- K-way merge: A heap-based priority queue is initialized with the first signature from each sorted file within a given bucket. Signatures are popped in sorted order and compared to the previously seen signature.
- Pair emission: When two consecutive signatures in the sorted stream have identical band hashes (all
hashes_per_bucketvalues match), a duplicate pair is emitted as a 4-tuple:(file_id1, doc_id1, file_id2, doc_id2). - Distributed processing: The work is distributed across workers, with
world_sizerequired to be divisible bynum_buckets. Multiple workers can process a single bucket by partitioning the hash range based on boundary values from the first signature file. - Index support: The stage supports incremental deduplication via index files. Previously processed datasets can be stored as index files, and new documents can be deduplicated against both the current batch and the historical index. New unique signatures can optionally be saved as a new index for future use.
A sentinel value (2^32 - 1, 2^32 - 1) is used to represent matches against index entries (which lack file/document IDs).
Usage
This is the second stage of the 4-stage MinHash dedup pipeline, run after signature computation. Key properties:
- Requires
world_sizeto be divisible bynum_buckets(default 14) - Supports partitioning a single bucket across multiple workers for parallelism
- Can operate in index-only mode (
only_dedup_in_index=True) to only find duplicates relative to a reference index - Produces binary
.dupsfiles containing matched pairs
Theoretical Basis
The bucket matching stage leverages two key algorithmic principles:
LSH Bucket Collision: Two documents become a candidate pair in a given bucket when all r hashes in the band match. Given that each hash independently has probability s of matching (where s is the true Jaccard similarity), the probability of a collision in a single band is:
Pr[band match] = s^r
The probability of being identified as a candidate pair in at least one of b bands is:
Pr[candidate] = 1 - (1 - s^r)^b
This S-curve ensures that highly similar documents are almost certainly identified while dissimilar documents are rarely flagged.
K-way Merge Complexity: The heap-based merge of K sorted files containing a total of N records runs in O(N log K) time. Each pop/push operation on the priority queue costs O(log K), and every record is processed exactly once. This is optimal for external merge of sorted streams and is well-suited to the distributed setting where each worker (rank) produces one sorted file per bucket.
Hash Range Partitioning: When multiple workers process the same bucket, the hash space is partitioned into contiguous ranges by reading boundary values from the first signature file. This ensures:
- Each worker processes a disjoint range of signatures
- No duplicate pairs are missed or double-counted
- Load is approximately balanced (assuming uniform hash distribution)