Principle:NVIDIA NeMo Curator Connected Component Analysis

Principle Metadata
Attribute	Value
Knowledge Sources	Paper: Scalable Data Deduplication
Domains	Data_Curation, Deduplication, Graph_Processing
Implemented By	NVIDIA_NeMo_Curator_ConnectedComponentsStage
Last Updated	2026-02-14 17:00 GMT

Overview

Connected Component Analysis is a technique for finding clusters of duplicate documents by computing weakly connected components in a document similarity graph.

Description

Connected Component Analysis uses GPU-accelerated graph algorithms (pylibcugraph) to find connected components in the edge graph produced by the Bucket to Edge Conversion stage, grouping all transitively-similar documents into clusters.

The process works as follows:

Edge aggregation — All edge Parquet files from the Bucket to Edge Conversion stage are loaded into a single graph. This requires reading ALL edges at once, since connected components is a global graph property.
Graph construction — The edge list is converted into a graph representation suitable for GPU-accelerated processing using RAFT/cuGraph.
Component labeling — The weakly connected components algorithm assigns each document a component label (_duplicate_group_id). All documents within the same component receive the same label.
Output — Each document ID is written with its assigned component label to Parquet files.

This stage is unique in the pipeline because it requires a global view of all edges. While previous stages can process partitions independently, connected component analysis must see the entire graph to correctly identify transitive relationships.

Usage

Connected Component Analysis is the fifth stage in the fuzzy deduplication pipeline, following Bucket to Edge Conversion. It consumes all edge files simultaneously and produces a mapping from document IDs to duplicate group IDs.

from nemo_curator.stages.deduplication.fuzzy.connected_components import ConnectedComponentsStage

stage = ConnectedComponentsStage(
    output_path="/output/connected_components/",
)

Theoretical Basis

Weakly connected components identify the transitive closure of similarity: if document A is similar to document B, and document B is similar to document C, then {A, B, C} form one component — even if A and C are not directly similar.

Formally, two nodes u and v are in the same weakly connected component if and only if there exists an undirected path from u to v in the graph. The connected components partition the vertex set into disjoint subsets:

$V = C_{1} \cup C_{2} \cup \dots \cup C_{k}, C_{i} \cap C_{j} = \emptyset for i \neq j$

Key properties:

Transitivity — The transitive closure ensures that even indirect duplicates are grouped together. This is essential for thorough deduplication, as chains of near-duplicates (A~B, B~C, C~D) should all be treated as one group.
Determinism — Connected components produce a deterministic partition of documents into groups, regardless of the order in which edges are processed.
Scalability — GPU acceleration via RAFT/cuGraph enables processing graphs with billions of edges. The algorithm runs in near-linear time with respect to the number of edges.

Potential concern — component inflation: Transitive closure can create very large components if the similarity graph has high connectivity (e.g., many documents sharing common boilerplate text). In extreme cases, a single component may contain millions of documents. The downstream Fuzzy Duplicate Identification stage handles this by retaining one representative per component regardless of size.

Related Pages

Implementation:NVIDIA_NeMo_Curator_ConnectedComponentsStage
NVIDIA_NeMo_Curator_Bucket_to_Edge_Conversion — The preceding stage that produces the edge graph
NVIDIA_NeMo_Curator_Fuzzy_Duplicate_Identification — The subsequent stage that selects duplicates for removal
NVIDIA_NeMo_Curator_Text_Deduplication — The parent concept covering all deduplication techniques

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment