Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup SimHash Union Find Cluster

From Leeroopedia
Knowledge Sources
Domains Clustering, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for clustering SimHash fingerprints using bit-permutation bucketing and Union-Find provided by text-dedup.

Description

The cluster function in simhash.py iterates over all documents sequentially, placing each into buckets keyed by __key__ tuples. For each document, it compares against existing documents in the same bucket using frozenbitarray XOR and popcount. If (sig ^ other_sig).count(1) <= bit_diff, it calls UnionFind.union() to merge the two documents. After processing all documents, get_clusters() returns the final cluster mapping. Only non-trivial entries (child != parent) are returned.

Usage

Use this after SimHash fingerprinting to form duplicate clusters.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/simhash.py
  • Lines: L43-91

Signature

def cluster(config: Config, ds: Dataset) -> dict[int, int]:
    """Cluster SimHash fingerprints using bucketing + Union-Find.

    Parameters
    ----------
    config : Config
        Pipeline configuration with SimHash settings.
    ds : Dataset
        Fingerprinted dataset with __key__, __val__, __INDEX__.

    Returns
    -------
    dict[int, int]
        Mapping child_index → parent_index (non-trivial only).
    """

def assign(config: Config, ds: Dataset, parents: dict[int, int]) -> Dataset:
    """Assign cluster id and duplicate flag to the dataset.

    Parameters
    ----------
    config : Config
        Pipeline configuration.
    ds : Dataset
        Original dataset.
    parents : dict[int, int]
        Cluster mapping from cluster().

    Returns
    -------
    Dataset
        Dataset with __CLUSTER__ and __duplicate__ columns.
    """

Import

from text_dedup.simhash import cluster, assign

I/O Contract

Inputs

Name Type Required Description
config Config Yes Pipeline configuration
ds Dataset Yes Fingerprinted dataset with __key__, __val__,

Outputs

Name Type Description
cluster() returns dict[int, int] Mapping child_index → parent_index
assign() returns Dataset Dataset with __CLUSTER__ and __duplicate__ columns

Usage Examples

Clustering SimHash Signatures

from text_dedup.simhash import cluster, assign, fingerprint, load_and_preprocess
from text_dedup.config.base import load_config_from_toml
from pathlib import Path

config = load_config_from_toml(Path("configs/simhash.toml"))
ds, _ = load_and_preprocess(config)
embedded = fingerprint(config, ds)

assignment = cluster(config, embedded)
ds = assign(config, ds, assignment)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment