Implementation:ChenghaoMou Text dedup UnionFind Class
| Knowledge Sources | |
|---|---|
| Domains | Data_Structures, Clustering |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for maintaining disjoint sets with path compression and union by rank for incremental clustering provided by text-dedup.
Description
The UnionFind class implements a disjoint set union data structure using two dictionaries: parent (maps element to its parent) and rank (maps element to its tree rank). Elements are lazily initialized on first access. Path compression is applied during find() to flatten the tree structure. Union by rank is applied during union() to keep trees balanced.
The get_clusters() method returns a complete mapping from every element to its cluster root by calling find() on each element.
Usage
Import this class when building clusters from pairwise duplicate relationships in the SimHash pipeline or during false positive verification re-clustering.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/utils/union_find.py
- Lines: L6-81
Signature
class UnionFind:
"""A Union-Find (Disjoint Set Union) data structure."""
def __init__(self) -> None:
self.parent: dict[int, int] = {}
self.rank: dict[int, int] = {}
def find(self, x: int) -> int:
"""Find root of set containing x with path compression."""
def union(self, x: int, y: int) -> None:
"""Union the sets containing x and y by rank."""
def reset(self) -> None:
"""Reset the Union-Find structure."""
def get_clusters(self) -> dict[int, int]:
"""Get cluster assignment for all elements.
Returns
-------
dict[int, int]
Mapping from element to its cluster root.
"""
Import
from text_dedup.utils.union_find import UnionFind
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x, y | int | Yes | Element indices to find or union |
Outputs
| Name | Type | Description |
|---|---|---|
| find() returns | int | Root element of the set |
| get_clusters() returns | dict[int, int] | Complete mapping element → cluster root |
Usage Examples
Building Clusters from Pairs
from text_dedup.utils.union_find import UnionFind
uf = UnionFind()
# Merge duplicate pairs
uf.union(0, 1) # Doc 0 and Doc 1 are duplicates
uf.union(1, 2) # Doc 1 and Doc 2 are duplicates (transitively, 0-1-2)
uf.union(5, 6) # Doc 5 and Doc 6 are duplicates
clusters = uf.get_clusters()
# {0: root, 1: root, 2: root, 5: root2, 6: root2}
print(uf.find(0) == uf.find(2)) # True (same cluster)
print(uf.find(0) == uf.find(5)) # False (different clusters)