Heuristic:Huggingface Datatrove MinHash Parameter Tuning
| Knowledge Sources | |
|---|---|
| Domains | Deduplication, Data_Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Mathematical rationale for MinHash LSH parameter choices (5-grams, 14 buckets, 8 hashes per bucket) that produce a Jaccard similarity threshold of ~0.72 with 92.4% recall at similarity 0.8.
Description
MinHash Locality-Sensitive Hashing (LSH) for near-duplicate detection has three key parameters: n-gram size (shingle length), number of buckets, and hashes per bucket. These parameters control the trade-off between sensitivity (catching duplicates), specificity (avoiding false positives), and computational cost (memory and time). Datatrove's defaults are mathematically tuned using the LSH probability formula to achieve a practical balance for web-scale deduplication.
Usage
Use this heuristic when configuring the MinHash deduplication pipeline (MinhashDedupSignature, MinhashDedupBuckets, MinhashDedupCluster, MinhashDedupFilter). The default parameters work well for web-crawled text datasets at the billion-document scale. Adjust parameters only if you have specific requirements for different similarity thresholds or if you need to trade memory for precision.
The Insight (Rule of Thumb)
- Action: Configure `MinhashConfig` with `n_grams=5`, `num_buckets=14`, `hashes_per_bucket=8`.
- Value: These defaults produce:
- Total signatures per document: 14 x 8 = 112
- Similarity threshold: `(1/14)^(1/8) ≈ 0.72` (documents above this Jaccard similarity are likely flagged)
- At Jaccard similarity 0.8: `1 - (1 - 0.8^8)^14 = 0.924` (92.4% probability of being detected as duplicate)
- At Jaccard similarity 0.5: very low detection probability (few false positives)
- Trade-off: More buckets or more hashes per bucket increase precision but require more memory and computation. 5-gram shingles balance between catching fuzzy matches (too small catches everything) and being too specific (too large misses paraphrases).
Scaling parallelism:
- Action: Set bucket processing tasks to `num_buckets * N` where N is workers per bucket.
- Value: Production FineWeb uses `14 * 50 = 700` tasks for bucket processing.
- Trade-off: More workers per bucket reduces wall-clock time but increases SLURM job overhead.
Mersenne prime for hashing:
- Action: Use `2^61 - 1` as the hash modulus (Mersenne prime).
- Value: Provides uniform hash distribution for 64-bit hash values.
Reasoning
The LSH probability formula governs how parameter choices map to detection probability:
Threshold formula: `threshold = (1/num_buckets)^(1/hashes_per_bucket)`
With 14 buckets and 8 hashes per bucket: `(1/14)^(1/8) ≈ 0.72`. This means documents with Jaccard similarity above 0.72 have a high probability of being placed in the same bucket.
Detection probability at similarity s: `P(detection) = 1 - (1 - s^hashes_per_bucket)^num_buckets`
At s=0.8: `1 - (1 - 0.8^8)^14 = 1 - (1 - 0.1678)^14 = 1 - 0.0756 = 0.924`
The 5-gram choice for shingle size is a practical balance: smaller n-grams (2-3) produce shingles that are too common across unrelated documents, inflating false positive rates. Larger n-grams (7+) are too specific and miss documents with minor edits or reorderings.
Code evidence from `src/datatrove/pipeline/dedup/minhash.py:27-35`:
# http://en.wikipedia.org/wiki/Mersenne_prime
_mersenne_prime = np.uint64((1 << 61) - 1)
"""
n_grams -> roughly nr of words (this should be small enough to catch fuzzy matches
but big enough to not have each shingle be too common)
threshold is (1/14)^(1/8)~0.72
threshold is real minhash similarity cutoff for high probability inclusion by LSH minhash
probability of inclusion for s=0.8: 1-(1-0.8^8)^14=0.924
"""
Code evidence from `src/datatrove/pipeline/dedup/minhash.py:40-54`:
@dataclass
class MinhashConfig:
n_grams: int = 5
num_buckets: int = 14
hashes_per_bucket: int = 8
seed: int = 1
Production scaling from `examples/fineweb.py:129`:
tasks=minhash_config.num_buckets * 50 # 14 * 50 = 700 tasks
Related Pages
- Implementation:Huggingface_Datatrove_MinhashDedupSignature
- Implementation:Huggingface_Datatrove_MinhashDedupBuckets
- Implementation:Huggingface_Datatrove_MinhashDedupCluster
- Implementation:Huggingface_Datatrove_MinhashDedupFilter
- Principle:Huggingface_Datatrove_MinHash_Signature_Computation
- Principle:Huggingface_Datatrove_MinHash_Bucket_Matching