Principle:ChenghaoMou Text dedup LSH Banding And Clustering
| Knowledge Sources | |
|---|---|
| Domains | Hashing, Clustering, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A banding technique for Locality-Sensitive Hashing that groups documents sharing identical band signatures and then merges groups into connected components to form duplicate clusters.
Description
LSH Banding and Clustering takes MinHash signatures split into b bands of r rows each and groups documents that share at least one identical band. The probability that two documents with Jaccard similarity s become candidates is 1 - (1 - s^r)^b, creating an S-curve that sharply separates similar from dissimilar pairs.
The clustering step converts candidate pairs into connected components using the polars_grouper.super_merger algorithm, which efficiently finds connected components in a graph of (src, dst) document pairs. Each connected component becomes a cluster, with one document designated as the representative (non-duplicate).
The optimal b and r parameters are computed by optimal_param, which minimizes the weighted sum of false positive and false negative areas under the S-curve using numerical integration.
Usage
Use this principle after MinHash fingerprinting when candidate pairs need to be identified and grouped into clusters. The banding parameters (bands, rows) control the precision/recall tradeoff of duplicate detection.
Theoretical Basis
The LSH banding probability:
Where s is the true Jaccard similarity, b is the number of bands, and r is the number of rows per band.
Optimal parameter selection minimizes:
Where t is the similarity threshold and w are the weights for false positives and false negatives.
# Abstract LSH clustering (NOT real implementation)
# 1. Group by band: documents with same (band_idx, band_val) are candidates
candidates = signatures.group_by(["band_idx", "band_val"])
# 2. Generate all pairs within each group
pairs = generate_pairs(candidates)
# 3. Find connected components
clusters = connected_components(pairs)