Implementation:ChenghaoMou Text dedup MinHash LSH Cluster
| Knowledge Sources | |
|---|---|
| Domains | Clustering, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for LSH banding, candidate generation, and connected-component clustering using Polars and polars_grouper provided by text-dedup.
Description
The cluster function in minhash.py converts a fingerprinted Dataset into clusters by: (1) converting to a Polars DataFrame, (2) grouping by (band_idx, band_val) to find candidate groups, (3) filtering groups with more than one document, (4) generating all document pairs via self-join, (5) finding connected components using polars_grouper.super_merger, and (6) returning a mapping from document index to cluster representative.
The assign function then maps this cluster assignment back onto the original Dataset, adding __CLUSTER__ and __duplicate__ columns.
Usage
Use this after MinHash fingerprinting to group candidate duplicate documents into clusters.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/minhash.py
- Lines: L48-91
Signature
def cluster(config: Config, ds: Dataset) -> dict[int, int]:
"""Cluster the dataset using LSH banding and connected components.
Parameters
----------
config : Config
Configuration with algorithm settings.
ds : Dataset
Fingerprinted dataset with __band_idx__, __band_val__, __INDEX__.
Returns
-------
dict[int, int]
Mapping from document index to cluster group ID.
"""
def assign(config: Config, ds: Dataset, parents: dict[int, int]) -> Dataset:
"""Assign cluster id to the dataset.
Parameters
----------
config : Config
Configuration with algorithm settings.
ds : Dataset
Original dataset.
parents : dict[int, int]
Cluster mapping from cluster().
Returns
-------
Dataset
Dataset with __CLUSTER__ and __duplicate__ columns added.
"""
Import
from text_dedup.minhash import cluster, assign
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | Config | Yes | Pipeline configuration |
| ds | Dataset | Yes | Fingerprinted dataset with __band_idx__, __band_val__, |
Outputs
| Name | Type | Description |
|---|---|---|
| cluster() returns | dict[int, int] | Mapping document_index → cluster_group_id |
| assign() returns | Dataset | Dataset with __CLUSTER__ and __duplicate__ columns |
Usage Examples
Clustering MinHash Signatures
from text_dedup.minhash import cluster, assign, fingerprint, load_and_preprocess
from text_dedup.config.base import load_config_from_toml
from pathlib import Path
config = load_config_from_toml(Path("configs/minhash.toml"))
ds, _, _ = load_and_preprocess(config)
embedded = fingerprint(config, ds)
# Find clusters via LSH banding + connected components
assignment = cluster(config, embedded)
ds = assign(config, ds, assignment)
print(ds.column_names) # [..., '__CLUSTER__', '__duplicate__']