Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:ChenghaoMou Text dedup Mersenne Prime Backward Compatibility

From Leeroopedia
Knowledge Sources
Domains Optimization, MinHash, Hashing
Last Updated 2026-02-14 21:00 GMT

Overview

The 64-bit MinHash configuration retains Mersenne prime modular arithmetic and byteswap operations solely for backward compatibility; testing shows neither provides a performance benefit.

Description

Legacy MinHash implementations (e.g., datasketch) used Mersenne primes (specifically 2^61 - 1) for the modulo operation in universal hashing because Mersenne prime division can be optimized via bitwise operations. The text-dedup codebase tested this assumption and found no measurable benefit in modern NumPy. Similarly, the `.byteswap()` call on hash band values was originally added for speed but empirical testing shows negligible impact. Both are retained only to maintain hash compatibility with existing data.

Usage

Apply this knowledge when choosing hash_bits for MinHash. The 64-bit mode uses legacy Mersenne prime arithmetic for backward compatibility with datasketch-compatible hashes. The 32-bit and 16-bit modes use simpler `2^n - c` formulas instead. If backward compatibility with existing MinHash signatures is not needed, 32-bit hashes are recommended for better memory efficiency with equivalent deduplication quality.

The Insight (Rule of Thumb)

  • Action: Use `hash_bits=32` for new projects; only use `hash_bits=64` if backward compatibility with datasketch is required.
  • Value: 64-bit uses Mersenne prime 2^61-1; 32-bit uses 2^32-5; 16-bit uses 2^16-15.
  • Trade-off: 64-bit doubles memory usage per hash with no deduplication quality benefit. Lower precision causes more hash collisions, biasing toward finding more duplicates.
  • Compatibility: The byteswap operation in band value serialization is kept for data compatibility despite negligible performance impact.

Reasoning

The universal hashing formula is `new_hash = (a * x + b) mod prime mod max_hash`. Mersenne primes were historically preferred because division by them can be replaced by bit shifts and additions. Modern CPUs and NumPy's optimized integer operations make this optimization irrelevant. Testing confirms:

  • Mersenne prime modulo: No speed benefit on modern hardware.
  • Byteswap: No speed benefit; kept for backward compatibility per datasketch issue #114.
  • Lower-bit hashes: More collisions lead to more documents flagged as duplicates. The 16-bit mode "assumes a perfect hash" which is challenged at lower precision, so the optimal band/row calculation may be less accurate.

Code Evidence

Hash configuration with Mersenne prime rationale from `src/text_dedup/config/algorithms/minhash.py:96-108`:

# 64bit config is backwards compatibility mode.
# it uses 64bit types but almost entirely 32bit data, except for one mersenne prime 2^61
# why legacy implementations used mersenne primes for modulo:
# https://en.wikipedia.org/wiki/Universal_hashing#Hashing_strings
# (dtype, max_hash, modulo_prime)
_hash_config: dict[int, tuple[type, Any, Any]] = {
    64: (np.uint64, np.uint32((1 << 32) - 1), np.uint64((1 << 61) - 1)),
    # 32, 16bit configs do not use a mersenne prime number.
    # The original reason for using mersenne prime was speed.
    # Testing reveals there is no benefit to using a 2^61 mersenne prime number for division
    32: (np.uint32, np.uint32((1 << 32) - 1), np.uint32((1 << 32) - 5)),
    16: (np.uint16, np.uint16((1 << 16) - 1), np.uint16((1 << 16) - 15)),
}

Byteswap backward compatibility from `src/text_dedup/config/algorithms/minhash.py:229-234`:

# Originally, byteswap was done for speed. Testing shows it has a negligible impact
# keeping for backward compatibility, even though theoretically and empirically
# it doesn't matter if it is there or not. github.com/ekzhu/datasketch/issues/114
return {
    "__band_idx__": list(range(len(hash_ranges))),
    "__band_val__": [bytes(hashvalues[start:end].byteswap().data) for (start, end) in hash_ranges],
    self.internal_index_column: [idx for _ in hash_ranges],
}

Lower precision warning from `src/text_dedup/config/algorithms/minhash.py:117-119`:

# The following assumes a "perfect hash". using 16bit hashes might challenge this assumption
# lower precision dtype will cause more collisions, so higher false_positives and fewer false negatives.
# Both effects move the result towards more documents being considered duplicates.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment