Implementation:ChenghaoMou Text dedup SimHash Union Find Cluster
| Knowledge Sources | |
|---|---|
| Domains | Clustering, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for clustering SimHash fingerprints using bit-permutation bucketing and Union-Find provided by text-dedup.
Description
The cluster function in simhash.py iterates over all documents sequentially, placing each into buckets keyed by __key__ tuples. For each document, it compares against existing documents in the same bucket using frozenbitarray XOR and popcount. If (sig ^ other_sig).count(1) <= bit_diff, it calls UnionFind.union() to merge the two documents. After processing all documents, get_clusters() returns the final cluster mapping. Only non-trivial entries (child != parent) are returned.
Usage
Use this after SimHash fingerprinting to form duplicate clusters.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/simhash.py
- Lines: L43-91
Signature
def cluster(config: Config, ds: Dataset) -> dict[int, int]:
"""Cluster SimHash fingerprints using bucketing + Union-Find.
Parameters
----------
config : Config
Pipeline configuration with SimHash settings.
ds : Dataset
Fingerprinted dataset with __key__, __val__, __INDEX__.
Returns
-------
dict[int, int]
Mapping child_index → parent_index (non-trivial only).
"""
def assign(config: Config, ds: Dataset, parents: dict[int, int]) -> Dataset:
"""Assign cluster id and duplicate flag to the dataset.
Parameters
----------
config : Config
Pipeline configuration.
ds : Dataset
Original dataset.
parents : dict[int, int]
Cluster mapping from cluster().
Returns
-------
Dataset
Dataset with __CLUSTER__ and __duplicate__ columns.
"""
Import
from text_dedup.simhash import cluster, assign
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | Config | Yes | Pipeline configuration |
| ds | Dataset | Yes | Fingerprinted dataset with __key__, __val__, |
Outputs
| Name | Type | Description |
|---|---|---|
| cluster() returns | dict[int, int] | Mapping child_index → parent_index |
| assign() returns | Dataset | Dataset with __CLUSTER__ and __duplicate__ columns |
Usage Examples
Clustering SimHash Signatures
from text_dedup.simhash import cluster, assign, fingerprint, load_and_preprocess
from text_dedup.config.base import load_config_from_toml
from pathlib import Path
config = load_config_from_toml(Path("configs/simhash.toml"))
ds, _ = load_and_preprocess(config)
embedded = fingerprint(config, ds)
assignment = cluster(config, embedded)
ds = assign(config, ds, assignment)