Implementation:NVIDIA NeMo Curator BucketsToEdgesStage

Implementation Metadata
Attribute	Value
Domains	Data_Curation, Deduplication, Graph_Processing
Implements	NVIDIA_NeMo_Curator_Bucket_to_Edge_Conversion
Last Updated	2026-02-14 17:00 GMT

Overview

BucketsToEdgesStage is the NeMo Curator processing stage that converts LSH bucket membership lists into pairwise document edges for graph-based duplicate detection.

Description

BucketsToEdgesStage implements the ProcessingStage[FileGroupTask, FileGroupTask] interface. It reads LSH bucket Parquet files (containing _bucket_id and _curator_dedup_id columns), groups documents by bucket, enumerates all pairwise combinations within each bucket using itertools.pairwise, and writes edge Parquet files containing two columns representing the source and destination document IDs of each edge.

The stage handles edge deduplication to ensure that each candidate pair appears only once in the output, regardless of how many buckets the pair co-occurred in.

Usage

from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

edges_stage = BucketsToEdgesStage(
    output_path="/output/edges/",
    document_id_field="_curator_dedup_id",
)

# Execute within a pipeline
output_tasks = edges_stage.process(lsh_bucket_task)

Code Reference

Source Location

nemo_curator/stages/deduplication/fuzzy/buckets_to_edges.py, lines 30–91.

Signature

class BucketsToEdgesStage(ProcessingStage[FileGroupTask, FileGroupTask]):
    def __init__(
        self,
        output_path: str,
        document_id_field: str = "_curator_dedup_id",
        ...
    )

Import

from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

I/O Contract

I/O Contract
Direction	Type	Description
Input	`FileGroupTask`	A task whose `.data` contains paths to LSH bucket Parquet files with `_bucket_id` and `_curator_dedup_id` columns
Output	`FileGroupTask`	A task whose `.data` contains paths to edge Parquet files with `_curator_dedup_id_x` and `_curator_dedup_id_y` columns
Output Column	`_curator_dedup_id_x`	Document ID of the first document in the candidate pair
Output Column	`_curator_dedup_id_y`	Document ID of the second document in the candidate pair
Parameters	`output_path`	Directory path where edge Parquet files are written
Parameters	`document_id_field`	Name of the document ID column (default: `"_curator_dedup_id"`)

Usage Examples

Example 1: Standard bucket-to-edge conversion

from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

stage = BucketsToEdgesStage(
    output_path="/output/edges/",
    document_id_field="_curator_dedup_id",
)

Example 2: Custom document ID field

from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

stage = BucketsToEdgesStage(
    output_path="/output/edges/",
    document_id_field="doc_id",
)

Example 3: Integration in a pipeline

from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage
from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

lsh_stage = LSHStage(
    num_bands=20,
    minhashes_per_band=13,
    output_path="/output/lsh_buckets/",
)

edges_stage = BucketsToEdgesStage(
    output_path="/output/edges/",
)

# Pipeline: LSH -> Buckets to Edges
bucket_tasks = lsh_stage.process(minhash_task)
edge_tasks = edges_stage.process(bucket_tasks)

Related Pages

Principle:NVIDIA_NeMo_Curator_Bucket_to_Edge_Conversion
NVIDIA_NeMo_Curator_LSHStage — Upstream stage that produces LSH bucket assignments
NVIDIA_NeMo_Curator_ConnectedComponentsStage — Downstream stage that computes connected components from edges
NVIDIA_NeMo_Curator_FuzzyDeduplicationWorkflow — The parent workflow orchestrating all stages
Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
Environment:NVIDIA_NeMo_Curator_RAPIDS_GPU_Stack
Environment:NVIDIA_NeMo_Curator_Ray_Cluster

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment