Heuristic:ChenghaoMou Text dedup SimHash Optimization Ceiling

Knowledge Sources	text-dedup Internal
Domains	Optimization, SimHash, Performance
Last Updated	2026-02-14 21:00 GMT

Overview

SimHash computation has reached its optimization ceiling: Cython and NumPy-based rewrites were tested but provided no performance improvement over pure Python with bitarray.

Description

The `compute()` function in SimHash converts hash lists into a fingerprint by summing bit vectors and thresholding. The developer explicitly documented in the source code that:

Porting to Cython did not improve performance.
Others experimented with NumPy types and operators but found no improvement.

The current implementation uses `np.asarray()` to convert bitarray lists to a NumPy integer array, performs vectorized sum and threshold, and converts back. This is already close to optimal for the operation.

Usage

Apply this knowledge when considering performance improvements to SimHash. Do not invest effort in rewriting the `compute()` function in Cython, C extensions, or pure NumPy. The bottleneck is elsewhere (likely I/O and tokenization). If SimHash performance is insufficient, consider reducing the fingerprint size (`f`) or using MinHash instead.

The Insight (Rule of Thumb)

Action: Do not attempt to optimize the SimHash `compute()` function further.
Value: Current implementation uses `np.asarray` + `np.where` + `np.sum` which is already efficient.
Trade-off: The function is simple and correct. Optimization attempts add complexity without measurable benefit.
Alternative: To improve overall SimHash performance, optimize tokenization or reduce `f` (fingerprint size).

Reasoning

The `compute()` function performs a small number of operations: it converts bit arrays to integer arrays, sums them column-wise, and thresholds at zero. This is a memory-bound operation rather than compute-bound, which explains why Cython (which primarily helps with compute-bound loops) provided no benefit. NumPy already handles the vectorized operations efficiently. The function is called once per document, and the actual hashing of individual n-grams (via xxh3) is already handled by a fast C extension.

Code Evidence

Developer note in source from `src/text_dedup/config/algorithms/simhash.py:222-227`:

def compute(hashes: list[bitarray]) -> bitarray:
    """
    Compute the Simhash of a list of hashes.

    Notes to myself: You tried porting this to Cython, but it didn't improve the performance.
    Others have experimented with numpy types and operators, but it didn't improve performance

The implementation itself from `src/text_dedup/config/algorithms/simhash.py:249-253`:

sigs = np.asarray([h.tolist() for h in hashes], dtype=int)
sig = np.where(np.sum(2 * sigs - 1, axis=0) > 0, True, False)
res = bitarray()
res.extend(sig.tolist())
return res

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment