Principle:NVIDIA NeMo Curator Connected Component Analysis
| Attribute | Value |
|---|---|
| Knowledge Sources | Paper: Scalable Data Deduplication |
| Domains | Data_Curation, Deduplication, Graph_Processing |
| Implemented By | NVIDIA_NeMo_Curator_ConnectedComponentsStage |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Connected Component Analysis is a technique for finding clusters of duplicate documents by computing weakly connected components in a document similarity graph.
Description
Connected Component Analysis uses GPU-accelerated graph algorithms (pylibcugraph) to find connected components in the edge graph produced by the Bucket to Edge Conversion stage, grouping all transitively-similar documents into clusters.
The process works as follows:
- Edge aggregation — All edge Parquet files from the Bucket to Edge Conversion stage are loaded into a single graph. This requires reading ALL edges at once, since connected components is a global graph property.
- Graph construction — The edge list is converted into a graph representation suitable for GPU-accelerated processing using RAFT/cuGraph.
- Component labeling — The weakly connected components algorithm assigns each document a component label (
_duplicate_group_id). All documents within the same component receive the same label. - Output — Each document ID is written with its assigned component label to Parquet files.
This stage is unique in the pipeline because it requires a global view of all edges. While previous stages can process partitions independently, connected component analysis must see the entire graph to correctly identify transitive relationships.
Usage
Connected Component Analysis is the fifth stage in the fuzzy deduplication pipeline, following Bucket to Edge Conversion. It consumes all edge files simultaneously and produces a mapping from document IDs to duplicate group IDs.
from nemo_curator.stages.deduplication.fuzzy.connected_components import ConnectedComponentsStage
stage = ConnectedComponentsStage(
output_path="/output/connected_components/",
)
Theoretical Basis
Weakly connected components identify the transitive closure of similarity: if document A is similar to document B, and document B is similar to document C, then {A, B, C} form one component — even if A and C are not directly similar.
Formally, two nodes u and v are in the same weakly connected component if and only if there exists an undirected path from u to v in the graph. The connected components partition the vertex set into disjoint subsets:
Key properties:
- Transitivity — The transitive closure ensures that even indirect duplicates are grouped together. This is essential for thorough deduplication, as chains of near-duplicates (A~B, B~C, C~D) should all be treated as one group.
- Determinism — Connected components produce a deterministic partition of documents into groups, regardless of the order in which edges are processed.
- Scalability — GPU acceleration via RAFT/cuGraph enables processing graphs with billions of edges. The algorithm runs in near-linear time with respect to the number of edges.
Potential concern — component inflation: Transitive closure can create very large components if the similarity graph has high connectivity (e.g., many documents sharing common boilerplate text). In extreme cases, a single component may contain millions of documents. The downstream Fuzzy Duplicate Identification stage handles this by retaining one representative per component regardless of size.
Related Pages
- Implementation:NVIDIA_NeMo_Curator_ConnectedComponentsStage
- NVIDIA_NeMo_Curator_Bucket_to_Edge_Conversion — The preceding stage that produces the edge graph
- NVIDIA_NeMo_Curator_Fuzzy_Duplicate_Identification — The subsequent stage that selects duplicates for removal
- NVIDIA_NeMo_Curator_Text_Deduplication — The parent concept covering all deduplication techniques