Principle:ChenghaoMou Text dedup Union Find

Knowledge Sources	Disjoint-Set Data Structure text-dedup
Domains	Data_Structures, Clustering
Last Updated	2026-02-14 21:00 GMT

Overview

A data structure that efficiently maintains disjoint sets and supports near-constant-time union and find operations for incremental clustering.

Description

Union-Find (Disjoint Set Union) is a fundamental data structure used in text-dedup to incrementally build clusters from pairwise similarity relationships. When two documents are found to be near-duplicates (e.g., by SimHash Hamming distance comparison), they are merged into the same set. The structure supports two key operations: find(x) returns the representative element (root) of the set containing x, and union(x, y) merges the sets containing x and y.

With path compression and union by rank optimizations, both operations achieve near-O(1) amortized time complexity (inverse Ackermann function). In text-dedup, Union-Find is used by: (1) SimHash clustering (primary), (2) SimHash false positive verification (re-clustering verified pairs).

Usage

Use this data structure when building clusters incrementally from pairwise relationships, particularly in the SimHash pipeline where documents are compared in buckets and merged on-the-fly.

Theoretical Basis

With path compression and union by rank: $Amortized time per operation = O (α (n))$

Where α is the inverse Ackermann function, effectively constant for all practical inputs.

# Abstract Union-Find operations (NOT real implementation)
uf = UnionFind()
uf.union(doc_a, doc_b)  # Merge sets containing a and b
uf.union(doc_b, doc_c)  # Now a, b, c are in same set
root = uf.find(doc_a)   # Returns the cluster representative
clusters = uf.get_clusters()  # {element: root} for all elements

Related Pages

Implemented By

Implementation:ChenghaoMou_Text_dedup_UnionFind_Class

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment