Principle:Huggingface Datatrove MinHash Signature Computation
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Computing compact MinHash signatures from document text for efficient approximate similarity detection. This is the first stage of the 4-stage MinHash deduplication pipeline and converts each document into a fixed-size hash signature that can be used to rapidly estimate pairwise Jaccard similarity without comparing full document contents.
Description
MinHash signature computation converts documents into fixed-size hash signatures that approximate Jaccard similarity. The process consists of four steps:
- Tokenize the document text into word n-grams (shingles). Text is first normalized (lowercased, whitespace-normalized, punctuation removed) and then tokenized using a language-specific word tokenizer. Consecutive words are grouped into n-grams of a configurable size (default: 5-grams).
- Hash each shingle using a hash function (configurable between 32-bit and 64-bit precision) to produce a set of integer hash values.
- Apply multiple permutation functions to each hashed shingle. For each of the
num_buckets * hashes_per_buckettotal hash functions (default: 14 * 8 = 112), compute(a * h + b) mod pwherepis the Mersenne prime2^61 - 1, anda,bare randomly generated parameters. Take the minimum hash value across all shingles for each function to form the signature. - Organize the 112 hash values into 14 bands (buckets) of 8 hashes each. Each band serves as an independent LSH (Locality-Sensitive Hashing) probe: documents that share identical band signatures in any bucket are candidate duplicates.
The resulting signatures are written as sorted binary files per bucket per processing rank. Sorting by signature enables efficient k-way merge in the subsequent bucket matching stage. The implementation verifies data integrity using Blake2b checksums after writing sorted records.
Usage
This is the first stage of the 4-stage MinHash dedup pipeline and runs in parallel across shards. Each worker:
- Reads documents from the input pipeline
- Computes MinHash signatures for each document
- Writes one sorted binary
.minhash.sigfile per bucket - Supports skipping signature computation if valid files already exist (
skip_existing_sigsmode)
Default configuration: 5-grams, 14 buckets, 8 hashes per bucket = 112 total hash functions. The similarity threshold is approximately (1/14)^(1/8) ~ 0.72 Jaccard similarity.
Theoretical Basis
The MinHash signature computation is grounded in two foundational results:
MinHash Probability Theorem (Broder, 1997): For two sets A and B, the probability that a random min-wise independent permutation produces the same minimum hash is exactly equal to the Jaccard similarity:
Pr[min(h(A)) = min(h(B))] = J(A,B) = |A ∩ B| / |A ∪ B|
By using multiple independent hash functions, the fraction of matching minimums in the signature converges to the true Jaccard similarity.
Locality-Sensitive Hashing (LSH): With b bands of r rows each, the probability that two documents become a candidate pair is:
Pr[candidate] = 1 - (1 - s^r)^b
where s is the Jaccard similarity. With the default configuration of b=14 bands and r=8 hashes per band:
- Threshold:
(1/b)^(1/r) = (1/14)^(1/8) ≈ 0.72 - At s=0.8:
1 - (1 - 0.8^8)^14 ≈ 0.924(92.4% probability of being flagged as candidate) - At s=0.5:
1 - (1 - 0.5^8)^14 ≈ 0.054(5.4% probability -- most dissimilar pairs are correctly ignored)
The S-curve behavior provides a sharp transition around the threshold, balancing recall of true duplicates against false positive rate.
Permutation Hashing: Instead of truly independent hash functions, the implementation uses universal hashing with the formula h'(x) = (a * x + b) mod p, where p = 2^61 - 1 is a Mersenne prime. This provides near-optimal min-wise independence with efficient computation via modular arithmetic.