Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:ChenghaoMou Text dedup Hamming Distance Clustering

From Leeroopedia
Knowledge Sources
Domains Clustering, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

A clustering technique that uses bit-permutation bucketing to find SimHash fingerprints within a specified Hamming distance, then merges them into clusters using Union-Find.

Description

After SimHash fingerprinting produces binary signatures with bit-permutation bucket keys, clustering proceeds by: (1) inserting each document's (key, signature) into a hash table keyed by bucket keys, (2) for each new document, comparing its signature against all existing documents in the same bucket using XOR and popcount to compute Hamming distance, (3) if the Hamming distance is <= bit_diff, merging the two documents into the same cluster via Union-Find.

The bit-permutation scheme guarantees that any two fingerprints differing in at most k bits will share at least one bucket key (where b-k blocks match exactly), ensuring no true positives are missed by the bucketing step.

Usage

Use this principle after SimHash fingerprinting when documents need to be grouped into near-duplicate clusters based on Hamming distance.

Theoretical Basis

The bucketing scheme from Manku et al. (2007):

Given b blocks and allowed k bit differences, generate all C(b, b-k) permutations. For each permutation, the first b-k blocks form a common prefix. Two fingerprints are candidates if they share any prefix.

# Abstract Hamming clustering (NOT real implementation)
buckets = defaultdict(list)
uf = UnionFind()
for doc_idx, key, signature in documents:
    for other_idx, other_sig in buckets[key]:
        if hamming_distance(signature, other_sig) <= bit_diff:
            uf.union(doc_idx, other_idx)
    buckets[key].append((doc_idx, signature))
clusters = uf.get_clusters()

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment