Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator BucketsToEdgesStage

From Leeroopedia
Implementation Metadata
Attribute Value
Domains Data_Curation, Deduplication, Graph_Processing
Implements NVIDIA_NeMo_Curator_Bucket_to_Edge_Conversion
Last Updated 2026-02-14 17:00 GMT

Overview

BucketsToEdgesStage is the NeMo Curator processing stage that converts LSH bucket membership lists into pairwise document edges for graph-based duplicate detection.

Description

BucketsToEdgesStage implements the ProcessingStage[FileGroupTask, FileGroupTask] interface. It reads LSH bucket Parquet files (containing _bucket_id and _curator_dedup_id columns), groups documents by bucket, enumerates all pairwise combinations within each bucket using itertools.pairwise, and writes edge Parquet files containing two columns representing the source and destination document IDs of each edge.

The stage handles edge deduplication to ensure that each candidate pair appears only once in the output, regardless of how many buckets the pair co-occurred in.

Usage

from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

edges_stage = BucketsToEdgesStage(
    output_path="/output/edges/",
    document_id_field="_curator_dedup_id",
)

# Execute within a pipeline
output_tasks = edges_stage.process(lsh_bucket_task)

Code Reference

Source Location

nemo_curator/stages/deduplication/fuzzy/buckets_to_edges.py, lines 30–91.

Signature

class BucketsToEdgesStage(ProcessingStage[FileGroupTask, FileGroupTask]):
    def __init__(
        self,
        output_path: str,
        document_id_field: str = "_curator_dedup_id",
        ...
    )

Import

from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

I/O Contract

I/O Contract
Direction Type Description
Input FileGroupTask A task whose .data contains paths to LSH bucket Parquet files with _bucket_id and _curator_dedup_id columns
Output FileGroupTask A task whose .data contains paths to edge Parquet files with _curator_dedup_id_x and _curator_dedup_id_y columns
Output Column _curator_dedup_id_x Document ID of the first document in the candidate pair
Output Column _curator_dedup_id_y Document ID of the second document in the candidate pair
Parameters output_path Directory path where edge Parquet files are written
Parameters document_id_field Name of the document ID column (default: "_curator_dedup_id")

Usage Examples

Example 1: Standard bucket-to-edge conversion

from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

stage = BucketsToEdgesStage(
    output_path="/output/edges/",
    document_id_field="_curator_dedup_id",
)

Example 2: Custom document ID field

from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

stage = BucketsToEdgesStage(
    output_path="/output/edges/",
    document_id_field="doc_id",
)

Example 3: Integration in a pipeline

from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage
from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

lsh_stage = LSHStage(
    num_bands=20,
    minhashes_per_band=13,
    output_path="/output/lsh_buckets/",
)

edges_stage = BucketsToEdgesStage(
    output_path="/output/edges/",
)

# Pipeline: LSH -> Buckets to Edges
bucket_tasks = lsh_stage.process(minhash_task)
edge_tasks = edges_stage.process(bucket_tasks)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment