Principle:ChenghaoMou Text dedup Hamming Distance Clustering
| Knowledge Sources | |
|---|---|
| Domains | Clustering, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A clustering technique that uses bit-permutation bucketing to find SimHash fingerprints within a specified Hamming distance, then merges them into clusters using Union-Find.
Description
After SimHash fingerprinting produces binary signatures with bit-permutation bucket keys, clustering proceeds by: (1) inserting each document's (key, signature) into a hash table keyed by bucket keys, (2) for each new document, comparing its signature against all existing documents in the same bucket using XOR and popcount to compute Hamming distance, (3) if the Hamming distance is <= bit_diff, merging the two documents into the same cluster via Union-Find.
The bit-permutation scheme guarantees that any two fingerprints differing in at most k bits will share at least one bucket key (where b-k blocks match exactly), ensuring no true positives are missed by the bucketing step.
Usage
Use this principle after SimHash fingerprinting when documents need to be grouped into near-duplicate clusters based on Hamming distance.
Theoretical Basis
The bucketing scheme from Manku et al. (2007):
Given b blocks and allowed k bit differences, generate all C(b, b-k) permutations. For each permutation, the first b-k blocks form a common prefix. Two fingerprints are candidates if they share any prefix.
# Abstract Hamming clustering (NOT real implementation)
buckets = defaultdict(list)
uf = UnionFind()
for doc_idx, key, signature in documents:
for other_idx, other_sig in buckets[key]:
if hamming_distance(signature, other_sig) <= bit_diff:
uf.union(doc_idx, other_idx)
buckets[key].append((doc_idx, signature))
clusters = uf.get_clusters()