Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup UnionFind Class

From Leeroopedia
Knowledge Sources
Domains Data_Structures, Clustering
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for maintaining disjoint sets with path compression and union by rank for incremental clustering provided by text-dedup.

Description

The UnionFind class implements a disjoint set union data structure using two dictionaries: parent (maps element to its parent) and rank (maps element to its tree rank). Elements are lazily initialized on first access. Path compression is applied during find() to flatten the tree structure. Union by rank is applied during union() to keep trees balanced.

The get_clusters() method returns a complete mapping from every element to its cluster root by calling find() on each element.

Usage

Import this class when building clusters from pairwise duplicate relationships in the SimHash pipeline or during false positive verification re-clustering.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/utils/union_find.py
  • Lines: L6-81

Signature

class UnionFind:
    """A Union-Find (Disjoint Set Union) data structure."""

    def __init__(self) -> None:
        self.parent: dict[int, int] = {}
        self.rank: dict[int, int] = {}

    def find(self, x: int) -> int:
        """Find root of set containing x with path compression."""

    def union(self, x: int, y: int) -> None:
        """Union the sets containing x and y by rank."""

    def reset(self) -> None:
        """Reset the Union-Find structure."""

    def get_clusters(self) -> dict[int, int]:
        """Get cluster assignment for all elements.

        Returns
        -------
        dict[int, int]
            Mapping from element to its cluster root.
        """

Import

from text_dedup.utils.union_find import UnionFind

I/O Contract

Inputs

Name Type Required Description
x, y int Yes Element indices to find or union

Outputs

Name Type Description
find() returns int Root element of the set
get_clusters() returns dict[int, int] Complete mapping element → cluster root

Usage Examples

Building Clusters from Pairs

from text_dedup.utils.union_find import UnionFind

uf = UnionFind()

# Merge duplicate pairs
uf.union(0, 1)  # Doc 0 and Doc 1 are duplicates
uf.union(1, 2)  # Doc 1 and Doc 2 are duplicates (transitively, 0-1-2)
uf.union(5, 6)  # Doc 5 and Doc 6 are duplicates

clusters = uf.get_clusters()
# {0: root, 1: root, 2: root, 5: root2, 6: root2}
print(uf.find(0) == uf.find(2))  # True (same cluster)
print(uf.find(0) == uf.find(5))  # False (different clusters)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment