Heuristic:ChenghaoMou Text dedup Mersenne Prime Backward Compatibility
| Knowledge Sources | |
|---|---|
| Domains | Optimization, MinHash, Hashing |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
The 64-bit MinHash configuration retains Mersenne prime modular arithmetic and byteswap operations solely for backward compatibility; testing shows neither provides a performance benefit.
Description
Legacy MinHash implementations (e.g., datasketch) used Mersenne primes (specifically 2^61 - 1) for the modulo operation in universal hashing because Mersenne prime division can be optimized via bitwise operations. The text-dedup codebase tested this assumption and found no measurable benefit in modern NumPy. Similarly, the `.byteswap()` call on hash band values was originally added for speed but empirical testing shows negligible impact. Both are retained only to maintain hash compatibility with existing data.
Usage
Apply this knowledge when choosing hash_bits for MinHash. The 64-bit mode uses legacy Mersenne prime arithmetic for backward compatibility with datasketch-compatible hashes. The 32-bit and 16-bit modes use simpler `2^n - c` formulas instead. If backward compatibility with existing MinHash signatures is not needed, 32-bit hashes are recommended for better memory efficiency with equivalent deduplication quality.
The Insight (Rule of Thumb)
- Action: Use `hash_bits=32` for new projects; only use `hash_bits=64` if backward compatibility with datasketch is required.
- Value: 64-bit uses Mersenne prime 2^61-1; 32-bit uses 2^32-5; 16-bit uses 2^16-15.
- Trade-off: 64-bit doubles memory usage per hash with no deduplication quality benefit. Lower precision causes more hash collisions, biasing toward finding more duplicates.
- Compatibility: The byteswap operation in band value serialization is kept for data compatibility despite negligible performance impact.
Reasoning
The universal hashing formula is `new_hash = (a * x + b) mod prime mod max_hash`. Mersenne primes were historically preferred because division by them can be replaced by bit shifts and additions. Modern CPUs and NumPy's optimized integer operations make this optimization irrelevant. Testing confirms:
- Mersenne prime modulo: No speed benefit on modern hardware.
- Byteswap: No speed benefit; kept for backward compatibility per datasketch issue #114.
- Lower-bit hashes: More collisions lead to more documents flagged as duplicates. The 16-bit mode "assumes a perfect hash" which is challenged at lower precision, so the optimal band/row calculation may be less accurate.
Code Evidence
Hash configuration with Mersenne prime rationale from `src/text_dedup/config/algorithms/minhash.py:96-108`:
# 64bit config is backwards compatibility mode.
# it uses 64bit types but almost entirely 32bit data, except for one mersenne prime 2^61
# why legacy implementations used mersenne primes for modulo:
# https://en.wikipedia.org/wiki/Universal_hashing#Hashing_strings
# (dtype, max_hash, modulo_prime)
_hash_config: dict[int, tuple[type, Any, Any]] = {
64: (np.uint64, np.uint32((1 << 32) - 1), np.uint64((1 << 61) - 1)),
# 32, 16bit configs do not use a mersenne prime number.
# The original reason for using mersenne prime was speed.
# Testing reveals there is no benefit to using a 2^61 mersenne prime number for division
32: (np.uint32, np.uint32((1 << 32) - 1), np.uint32((1 << 32) - 5)),
16: (np.uint16, np.uint16((1 << 16) - 1), np.uint16((1 << 16) - 15)),
}
Byteswap backward compatibility from `src/text_dedup/config/algorithms/minhash.py:229-234`:
# Originally, byteswap was done for speed. Testing shows it has a negligible impact
# keeping for backward compatibility, even though theoretically and empirically
# it doesn't matter if it is there or not. github.com/ekzhu/datasketch/issues/114
return {
"__band_idx__": list(range(len(hash_ranges))),
"__band_val__": [bytes(hashvalues[start:end].byteswap().data) for (start, end) in hash_ranges],
self.internal_index_column: [idx for _ in hash_ranges],
}
Lower precision warning from `src/text_dedup/config/algorithms/minhash.py:117-119`:
# The following assumes a "perfect hash". using 16bit hashes might challenge this assumption
# lower precision dtype will cause more collisions, so higher false_positives and fewer false negatives.
# Both effects move the result towards more documents being considered duplicates.