Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ChenghaoMou Text dedup LSH Banding And Clustering

From Leeroopedia
Knowledge Sources
Domains Hashing, Clustering, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

A banding technique for Locality-Sensitive Hashing that groups documents sharing identical band signatures and then merges groups into connected components to form duplicate clusters.

Description

LSH Banding and Clustering takes MinHash signatures split into b bands of r rows each and groups documents that share at least one identical band. The probability that two documents with Jaccard similarity s become candidates is 1 - (1 - s^r)^b, creating an S-curve that sharply separates similar from dissimilar pairs.

The clustering step converts candidate pairs into connected components using the polars_grouper.super_merger algorithm, which efficiently finds connected components in a graph of (src, dst) document pairs. Each connected component becomes a cluster, with one document designated as the representative (non-duplicate).

The optimal b and r parameters are computed by optimal_param, which minimizes the weighted sum of false positive and false negative areas under the S-curve using numerical integration.

Usage

Use this principle after MinHash fingerprinting when candidate pairs need to be identified and grouped into clusters. The banding parameters (bands, rows) control the precision/recall tradeoff of duplicate detection.

Theoretical Basis

The LSH banding probability: P(candidate|s)=1(1sr)b

Where s is the true Jaccard similarity, b is the number of bands, and r is the number of rows per band.

Optimal parameter selection minimizes: error=wFP0tP(s)ds+wFNt1(1P(s))ds

Where t is the similarity threshold and w are the weights for false positives and false negatives.

# Abstract LSH clustering (NOT real implementation)
# 1. Group by band: documents with same (band_idx, band_val) are candidates
candidates = signatures.group_by(["band_idx", "band_val"])
# 2. Generate all pairs within each group
pairs = generate_pairs(candidates)
# 3. Find connected components
clusters = connected_components(pairs)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment