Implementation:NVIDIA NeMo Curator BucketsToEdgesStage
| Attribute | Value |
|---|---|
| Domains | Data_Curation, Deduplication, Graph_Processing |
| Implements | NVIDIA_NeMo_Curator_Bucket_to_Edge_Conversion |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
BucketsToEdgesStage is the NeMo Curator processing stage that converts LSH bucket membership lists into pairwise document edges for graph-based duplicate detection.
Description
BucketsToEdgesStage implements the ProcessingStage[FileGroupTask, FileGroupTask] interface. It reads LSH bucket Parquet files (containing _bucket_id and _curator_dedup_id columns), groups documents by bucket, enumerates all pairwise combinations within each bucket using itertools.pairwise, and writes edge Parquet files containing two columns representing the source and destination document IDs of each edge.
The stage handles edge deduplication to ensure that each candidate pair appears only once in the output, regardless of how many buckets the pair co-occurred in.
Usage
from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage
edges_stage = BucketsToEdgesStage(
output_path="/output/edges/",
document_id_field="_curator_dedup_id",
)
# Execute within a pipeline
output_tasks = edges_stage.process(lsh_bucket_task)
Code Reference
Source Location
nemo_curator/stages/deduplication/fuzzy/buckets_to_edges.py, lines 30–91.
Signature
class BucketsToEdgesStage(ProcessingStage[FileGroupTask, FileGroupTask]):
def __init__(
self,
output_path: str,
document_id_field: str = "_curator_dedup_id",
...
)
Import
from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | FileGroupTask |
A task whose .data contains paths to LSH bucket Parquet files with _bucket_id and _curator_dedup_id columns
|
| Output | FileGroupTask |
A task whose .data contains paths to edge Parquet files with _curator_dedup_id_x and _curator_dedup_id_y columns
|
| Output Column | _curator_dedup_id_x |
Document ID of the first document in the candidate pair |
| Output Column | _curator_dedup_id_y |
Document ID of the second document in the candidate pair |
| Parameters | output_path |
Directory path where edge Parquet files are written |
| Parameters | document_id_field |
Name of the document ID column (default: "_curator_dedup_id")
|
Usage Examples
Example 1: Standard bucket-to-edge conversion
from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage
stage = BucketsToEdgesStage(
output_path="/output/edges/",
document_id_field="_curator_dedup_id",
)
Example 2: Custom document ID field
from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage
stage = BucketsToEdgesStage(
output_path="/output/edges/",
document_id_field="doc_id",
)
Example 3: Integration in a pipeline
from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage
from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage
lsh_stage = LSHStage(
num_bands=20,
minhashes_per_band=13,
output_path="/output/lsh_buckets/",
)
edges_stage = BucketsToEdgesStage(
output_path="/output/edges/",
)
# Pipeline: LSH -> Buckets to Edges
bucket_tasks = lsh_stage.process(minhash_task)
edge_tasks = edges_stage.process(bucket_tasks)
Related Pages
- Principle:NVIDIA_NeMo_Curator_Bucket_to_Edge_Conversion
- NVIDIA_NeMo_Curator_LSHStage — Upstream stage that produces LSH bucket assignments
- NVIDIA_NeMo_Curator_ConnectedComponentsStage — Downstream stage that computes connected components from edges
- NVIDIA_NeMo_Curator_FuzzyDeduplicationWorkflow — The parent workflow orchestrating all stages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- Environment:NVIDIA_NeMo_Curator_RAPIDS_GPU_Stack
- Environment:NVIDIA_NeMo_Curator_Ray_Cluster