Principle:ChenghaoMou Text dedup Union Find
| Knowledge Sources | |
|---|---|
| Domains | Data_Structures, Clustering |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A data structure that efficiently maintains disjoint sets and supports near-constant-time union and find operations for incremental clustering.
Description
Union-Find (Disjoint Set Union) is a fundamental data structure used in text-dedup to incrementally build clusters from pairwise similarity relationships. When two documents are found to be near-duplicates (e.g., by SimHash Hamming distance comparison), they are merged into the same set. The structure supports two key operations: find(x) returns the representative element (root) of the set containing x, and union(x, y) merges the sets containing x and y.
With path compression and union by rank optimizations, both operations achieve near-O(1) amortized time complexity (inverse Ackermann function). In text-dedup, Union-Find is used by: (1) SimHash clustering (primary), (2) SimHash false positive verification (re-clustering verified pairs).
Usage
Use this data structure when building clusters incrementally from pairwise relationships, particularly in the SimHash pipeline where documents are compared in buckets and merged on-the-fly.
Theoretical Basis
With path compression and union by rank:
Where α is the inverse Ackermann function, effectively constant for all practical inputs.
# Abstract Union-Find operations (NOT real implementation)
uf = UnionFind()
uf.union(doc_a, doc_b) # Merge sets containing a and b
uf.union(doc_b, doc_c) # Now a, b, c are in same set
root = uf.find(doc_a) # Returns the cluster representative
clusters = uf.get_clusters() # {element: root} for all elements