Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:ChenghaoMou Text dedup SimHash Optimization Ceiling

From Leeroopedia
Knowledge Sources
Domains Optimization, SimHash, Performance
Last Updated 2026-02-14 21:00 GMT

Overview

SimHash computation has reached its optimization ceiling: Cython and NumPy-based rewrites were tested but provided no performance improvement over pure Python with bitarray.

Description

The `compute()` function in SimHash converts hash lists into a fingerprint by summing bit vectors and thresholding. The developer explicitly documented in the source code that:

  1. Porting to Cython did not improve performance.
  2. Others experimented with NumPy types and operators but found no improvement.

The current implementation uses `np.asarray()` to convert bitarray lists to a NumPy integer array, performs vectorized sum and threshold, and converts back. This is already close to optimal for the operation.

Usage

Apply this knowledge when considering performance improvements to SimHash. Do not invest effort in rewriting the `compute()` function in Cython, C extensions, or pure NumPy. The bottleneck is elsewhere (likely I/O and tokenization). If SimHash performance is insufficient, consider reducing the fingerprint size (`f`) or using MinHash instead.

The Insight (Rule of Thumb)

  • Action: Do not attempt to optimize the SimHash `compute()` function further.
  • Value: Current implementation uses `np.asarray` + `np.where` + `np.sum` which is already efficient.
  • Trade-off: The function is simple and correct. Optimization attempts add complexity without measurable benefit.
  • Alternative: To improve overall SimHash performance, optimize tokenization or reduce `f` (fingerprint size).

Reasoning

The `compute()` function performs a small number of operations: it converts bit arrays to integer arrays, sums them column-wise, and thresholds at zero. This is a memory-bound operation rather than compute-bound, which explains why Cython (which primarily helps with compute-bound loops) provided no benefit. NumPy already handles the vectorized operations efficiently. The function is called once per document, and the actual hashing of individual n-grams (via xxh3) is already handled by a fast C extension.

Code Evidence

Developer note in source from `src/text_dedup/config/algorithms/simhash.py:222-227`:

def compute(hashes: list[bitarray]) -> bitarray:
    """
    Compute the Simhash of a list of hashes.

    Notes to myself: You tried porting this to Cython, but it didn't improve the performance.
    Others have experimented with numpy types and operators, but it didn't improve performance

The implementation itself from `src/text_dedup/config/algorithms/simhash.py:249-253`:

sigs = np.asarray([h.tolist() for h in hashes], dtype=int)
sig = np.where(np.sum(2 * sigs - 1, axis=0) > 0, True, False)
res = bitarray()
res.extend(sig.tolist())
return res

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment