Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Huggingface Datatrove MinHash Parameter Tuning

From Leeroopedia
Knowledge Sources
Domains Deduplication, Data_Processing
Last Updated 2026-02-14 17:00 GMT

Overview

Mathematical rationale for MinHash LSH parameter choices (5-grams, 14 buckets, 8 hashes per bucket) that produce a Jaccard similarity threshold of ~0.72 with 92.4% recall at similarity 0.8.

Description

MinHash Locality-Sensitive Hashing (LSH) for near-duplicate detection has three key parameters: n-gram size (shingle length), number of buckets, and hashes per bucket. These parameters control the trade-off between sensitivity (catching duplicates), specificity (avoiding false positives), and computational cost (memory and time). Datatrove's defaults are mathematically tuned using the LSH probability formula to achieve a practical balance for web-scale deduplication.

Usage

Use this heuristic when configuring the MinHash deduplication pipeline (MinhashDedupSignature, MinhashDedupBuckets, MinhashDedupCluster, MinhashDedupFilter). The default parameters work well for web-crawled text datasets at the billion-document scale. Adjust parameters only if you have specific requirements for different similarity thresholds or if you need to trade memory for precision.

The Insight (Rule of Thumb)

  • Action: Configure `MinhashConfig` with `n_grams=5`, `num_buckets=14`, `hashes_per_bucket=8`.
  • Value: These defaults produce:
    • Total signatures per document: 14 x 8 = 112
    • Similarity threshold: `(1/14)^(1/8) ≈ 0.72` (documents above this Jaccard similarity are likely flagged)
    • At Jaccard similarity 0.8: `1 - (1 - 0.8^8)^14 = 0.924` (92.4% probability of being detected as duplicate)
    • At Jaccard similarity 0.5: very low detection probability (few false positives)
  • Trade-off: More buckets or more hashes per bucket increase precision but require more memory and computation. 5-gram shingles balance between catching fuzzy matches (too small catches everything) and being too specific (too large misses paraphrases).

Scaling parallelism:

  • Action: Set bucket processing tasks to `num_buckets * N` where N is workers per bucket.
  • Value: Production FineWeb uses `14 * 50 = 700` tasks for bucket processing.
  • Trade-off: More workers per bucket reduces wall-clock time but increases SLURM job overhead.

Mersenne prime for hashing:

  • Action: Use `2^61 - 1` as the hash modulus (Mersenne prime).
  • Value: Provides uniform hash distribution for 64-bit hash values.

Reasoning

The LSH probability formula governs how parameter choices map to detection probability:

Threshold formula: `threshold = (1/num_buckets)^(1/hashes_per_bucket)`

With 14 buckets and 8 hashes per bucket: `(1/14)^(1/8) ≈ 0.72`. This means documents with Jaccard similarity above 0.72 have a high probability of being placed in the same bucket.

Detection probability at similarity s: `P(detection) = 1 - (1 - s^hashes_per_bucket)^num_buckets`

At s=0.8: `1 - (1 - 0.8^8)^14 = 1 - (1 - 0.1678)^14 = 1 - 0.0756 = 0.924`

The 5-gram choice for shingle size is a practical balance: smaller n-grams (2-3) produce shingles that are too common across unrelated documents, inflating false positive rates. Larger n-grams (7+) are too specific and miss documents with minor edits or reorderings.

Code evidence from `src/datatrove/pipeline/dedup/minhash.py:27-35`:

# http://en.wikipedia.org/wiki/Mersenne_prime
_mersenne_prime = np.uint64((1 << 61) - 1)

"""
n_grams -> roughly nr of words (this should be small enough to catch fuzzy matches
           but big enough to not have each shingle be too common)
threshold is (1/14)^(1/8)~0.72
threshold is real minhash similarity cutoff for high probability inclusion by LSH minhash
probability of inclusion for s=0.8: 1-(1-0.8^8)^14=0.924
"""

Code evidence from `src/datatrove/pipeline/dedup/minhash.py:40-54`:

@dataclass
class MinhashConfig:
    n_grams: int = 5
    num_buckets: int = 14
    hashes_per_bucket: int = 8
    seed: int = 1

Production scaling from `examples/fineweb.py:129`:

tasks=minhash_config.num_buckets * 50  # 14 * 50 = 700 tasks

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment