Principle:Huggingface Datatrove MinHash Bucket Matching

**Metadata**
Knowledge Sources	Datatrove MinHash Gopher
Domains	Deduplication Information_Retrieval
Last Updated	2026-02-14 00:00 GMT

Overview

Finding candidate duplicate pairs by matching documents with identical hash signatures within LSH buckets. This stage performs a distributed merge-sort across all signature files within each bucket to identify document pairs that share the same band signature, indicating they are potential near-duplicates.

Description

Bucket matching is the second stage of the MinHash deduplication pipeline. It takes the sorted binary signature files produced by the signature computation stage and identifies all document pairs that share an identical band signature within the same bucket.

The matching process works as follows:

K-way merge: A heap-based priority queue is initialized with the first signature from each sorted file within a given bucket. Signatures are popped in sorted order and compared to the previously seen signature.
Pair emission: When two consecutive signatures in the sorted stream have identical band hashes (all hashes_per_bucket values match), a duplicate pair is emitted as a 4-tuple: (file_id1, doc_id1, file_id2, doc_id2).
Distributed processing: The work is distributed across workers, with world_size required to be divisible by num_buckets. Multiple workers can process a single bucket by partitioning the hash range based on boundary values from the first signature file.
Index support: The stage supports incremental deduplication via index files. Previously processed datasets can be stored as index files, and new documents can be deduplicated against both the current batch and the historical index. New unique signatures can optionally be saved as a new index for future use.

A sentinel value (2^32 - 1, 2^32 - 1) is used to represent matches against index entries (which lack file/document IDs).

Usage

This is the second stage of the 4-stage MinHash dedup pipeline, run after signature computation. Key properties:

Requires world_size to be divisible by num_buckets (default 14)
Supports partitioning a single bucket across multiple workers for parallelism
Can operate in index-only mode (only_dedup_in_index=True) to only find duplicates relative to a reference index
Produces binary .dups files containing matched pairs

Theoretical Basis

The bucket matching stage leverages two key algorithmic principles:

LSH Bucket Collision: Two documents become a candidate pair in a given bucket when all r hashes in the band match. Given that each hash independently has probability s of matching (where s is the true Jaccard similarity), the probability of a collision in a single band is:

Pr[band match] = s^r

The probability of being identified as a candidate pair in at least one of b bands is:

Pr[candidate] = 1 - (1 - s^r)^b

This S-curve ensures that highly similar documents are almost certainly identified while dissimilar documents are rarely flagged.

K-way Merge Complexity: The heap-based merge of K sorted files containing a total of N records runs in O(N log K) time. Each pop/push operation on the priority queue costs O(log K), and every record is processed exactly once. This is optimal for external merge of sorted streams and is well-suited to the distributed setting where each worker (rank) produces one sorted file per bucket.

Hash Range Partitioning: When multiple workers process the same bucket, the hash space is partitioned into contiguous ranges by reading boundary values from the first signature file. This ensures:

Each worker processes a disjoint range of signatures
No duplicate pairs are missed or double-counted
Load is approximately balanced (assuming uniform hash distribution)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment