Implementation:ChenghaoMou Text dedup UnionFind Class

Knowledge Sources	text-dedup
Domains	Data_Structures, Clustering
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for maintaining disjoint sets with path compression and union by rank for incremental clustering provided by text-dedup.

Description

The UnionFind class implements a disjoint set union data structure using two dictionaries: parent (maps element to its parent) and rank (maps element to its tree rank). Elements are lazily initialized on first access. Path compression is applied during find() to flatten the tree structure. Union by rank is applied during union() to keep trees balanced.

The get_clusters() method returns a complete mapping from every element to its cluster root by calling find() on each element.

Usage

Import this class when building clusters from pairwise duplicate relationships in the SimHash pipeline or during false positive verification re-clustering.

Code Reference

Source Location

Repository: text-dedup
File: src/text_dedup/utils/union_find.py
Lines: L6-81

Signature

class UnionFind:
    """A Union-Find (Disjoint Set Union) data structure."""

    def __init__(self) -> None:
        self.parent: dict[int, int] = {}
        self.rank: dict[int, int] = {}

    def find(self, x: int) -> int:
        """Find root of set containing x with path compression."""

    def union(self, x: int, y: int) -> None:
        """Union the sets containing x and y by rank."""

    def reset(self) -> None:
        """Reset the Union-Find structure."""

    def get_clusters(self) -> dict[int, int]:
        """Get cluster assignment for all elements.

        Returns
        -------
        dict[int, int]
            Mapping from element to its cluster root.
        """

Import

from text_dedup.utils.union_find import UnionFind

I/O Contract

Inputs

Name	Type	Required	Description
x, y	int	Yes	Element indices to find or union

Outputs

Name	Type	Description
find() returns	int	Root element of the set
get_clusters() returns	dict[int, int]	Complete mapping element → cluster root

Usage Examples

Building Clusters from Pairs

from text_dedup.utils.union_find import UnionFind

uf = UnionFind()

# Merge duplicate pairs
uf.union(0, 1)  # Doc 0 and Doc 1 are duplicates
uf.union(1, 2)  # Doc 1 and Doc 2 are duplicates (transitively, 0-1-2)
uf.union(5, 6)  # Doc 5 and Doc 6 are duplicates

clusters = uf.get_clusters()
# {0: root, 1: root, 2: root, 5: root2, 6: root2}
print(uf.find(0) == uf.find(2))  # True (same cluster)
print(uf.find(0) == uf.find(5))  # False (different clusters)

Related Pages

Implements Principle

Principle:ChenghaoMou_Text_dedup_Union_Find

Requires Environment

Environment:ChenghaoMou_Text_dedup_Python_3_12_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment