Principle:ChenghaoMou Text dedup Hamming Distance Clustering

Knowledge Sources	Detecting Near-Duplicates for Web Crawling text-dedup
Domains	Clustering, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

A clustering technique that uses bit-permutation bucketing to find SimHash fingerprints within a specified Hamming distance, then merges them into clusters using Union-Find.

Description

After SimHash fingerprinting produces binary signatures with bit-permutation bucket keys, clustering proceeds by: (1) inserting each document's (key, signature) into a hash table keyed by bucket keys, (2) for each new document, comparing its signature against all existing documents in the same bucket using XOR and popcount to compute Hamming distance, (3) if the Hamming distance is <= bit_diff, merging the two documents into the same cluster via Union-Find.

The bit-permutation scheme guarantees that any two fingerprints differing in at most k bits will share at least one bucket key (where b-k blocks match exactly), ensuring no true positives are missed by the bucketing step.

Usage

Use this principle after SimHash fingerprinting when documents need to be grouped into near-duplicate clusters based on Hamming distance.

Theoretical Basis

The bucketing scheme from Manku et al. (2007):

Given b blocks and allowed k bit differences, generate all C(b, b-k) permutations. For each permutation, the first b-k blocks form a common prefix. Two fingerprints are candidates if they share any prefix.

# Abstract Hamming clustering (NOT real implementation)
buckets = defaultdict(list)
uf = UnionFind()
for doc_idx, key, signature in documents:
    for other_idx, other_sig in buckets[key]:
        if hamming_distance(signature, other_sig) <= bit_diff:
            uf.union(doc_idx, other_idx)
    buckets[key].append((doc_idx, signature))
clusters = uf.get_clusters()

Related Pages

Implemented By

Implementation:ChenghaoMou_Text_dedup_SimHash_Union_Find_Cluster

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment