Principle:NVIDIA NeMo Curator Bucket to Edge Conversion

Principle Metadata
Attribute	Value
Domains	Data_Curation, Deduplication, Graph_Processing
Implemented By	NVIDIA_NeMo_Curator_BucketsToEdgesStage
Last Updated	2026-02-14 17:00 GMT

Overview

Bucket to Edge Conversion is a technique for converting LSH bucket membership lists into pairwise document edges for graph-based duplicate detection.

Description

Bucket to Edge Conversion transforms buckets (groups of potentially similar document IDs) into edge pairs using itertools.pairwise, creating a graph where edges represent candidate duplicates. The conversion operates as follows:

Bucket ingestion — Each bucket contains a list of document IDs that were hashed to the same LSH bucket in at least one band.
Pairwise enumeration — For each bucket, all pairs of document IDs are enumerated as edges. A bucket with k documents produces $(\binom{k}{2}) = k (k - 1) / 2$ edges.
Edge deduplication — Since the same pair of documents may appear in multiple buckets (across different bands), duplicate edges are removed to produce a clean edge list.

The resulting edge list forms the adjacency representation of a candidate duplicate graph, where nodes are documents and edges indicate that two documents are potential near-duplicates. This graph is then passed to the Connected Component Analysis stage for clustering.

Usage

Bucket to Edge Conversion is the fourth stage in the fuzzy deduplication pipeline, following LSH bucketing. It reads LSH bucket Parquet files and writes edge Parquet files containing pairs of document IDs.

from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

stage = BucketsToEdgesStage(
    output_path="/output/edges/",
    document_id_field="_curator_dedup_id",
)

Theoretical Basis

For each bucket, all pairs of documents are potential duplicates. Pairwise enumeration creates edges for connected-components analysis. The theoretical basis rests on several key observations:

Completeness — If two documents are true near-duplicates, they will appear in the same bucket with high probability (determined by the LSH S-curve). Enumerating all pairs within buckets ensures no candidate pair is missed.
Graph representation — The edge list naturally represents a graph where connected components correspond to clusters of mutually similar documents. This graph-theoretic formulation enables efficient clustering via standard algorithms.
Edge explosion management — Large buckets can generate a combinatorial number of edges ( $O (k^{2})$ for a bucket of size k). In practice, bucket sizes follow a power-law distribution, and the majority of buckets are small (2–5 documents), keeping the total edge count manageable.

The conversion from buckets to edges is a necessary transformation because connected-component algorithms operate on edge lists (adjacency representations), not on set-membership representations. This stage bridges the gap between the LSH output format and the graph algorithm input format.

Related Pages

Implementation:NVIDIA_NeMo_Curator_BucketsToEdgesStage
NVIDIA_NeMo_Curator_Locality_Sensitive_Hashing — The preceding stage that produces LSH buckets
NVIDIA_NeMo_Curator_Connected_Component_Analysis — The subsequent stage that finds connected components in the edge graph
NVIDIA_NeMo_Curator_Text_Deduplication — The parent concept covering all deduplication techniques

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment