Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Curator Bucket to Edge Conversion

From Leeroopedia
Principle Metadata
Attribute Value
Domains Data_Curation, Deduplication, Graph_Processing
Implemented By NVIDIA_NeMo_Curator_BucketsToEdgesStage
Last Updated 2026-02-14 17:00 GMT

Overview

Bucket to Edge Conversion is a technique for converting LSH bucket membership lists into pairwise document edges for graph-based duplicate detection.

Description

Bucket to Edge Conversion transforms buckets (groups of potentially similar document IDs) into edge pairs using itertools.pairwise, creating a graph where edges represent candidate duplicates. The conversion operates as follows:

  1. Bucket ingestion — Each bucket contains a list of document IDs that were hashed to the same LSH bucket in at least one band.
  2. Pairwise enumeration — For each bucket, all pairs of document IDs are enumerated as edges. A bucket with k documents produces (k2)=k(k1)/2 edges.
  3. Edge deduplication — Since the same pair of documents may appear in multiple buckets (across different bands), duplicate edges are removed to produce a clean edge list.

The resulting edge list forms the adjacency representation of a candidate duplicate graph, where nodes are documents and edges indicate that two documents are potential near-duplicates. This graph is then passed to the Connected Component Analysis stage for clustering.

Usage

Bucket to Edge Conversion is the fourth stage in the fuzzy deduplication pipeline, following LSH bucketing. It reads LSH bucket Parquet files and writes edge Parquet files containing pairs of document IDs.

from nemo_curator.stages.deduplication.fuzzy.buckets_to_edges import BucketsToEdgesStage

stage = BucketsToEdgesStage(
    output_path="/output/edges/",
    document_id_field="_curator_dedup_id",
)

Theoretical Basis

For each bucket, all pairs of documents are potential duplicates. Pairwise enumeration creates edges for connected-components analysis. The theoretical basis rests on several key observations:

  • Completeness — If two documents are true near-duplicates, they will appear in the same bucket with high probability (determined by the LSH S-curve). Enumerating all pairs within buckets ensures no candidate pair is missed.
  • Graph representation — The edge list naturally represents a graph where connected components correspond to clusters of mutually similar documents. This graph-theoretic formulation enables efficient clustering via standard algorithms.
  • Edge explosion management — Large buckets can generate a combinatorial number of edges (O(k2) for a bucket of size k). In practice, bucket sizes follow a power-law distribution, and the majority of buckets are small (2–5 documents), keeping the total edge count manageable.

The conversion from buckets to edges is a necessary transformation because connected-component algorithms operate on edge lists (adjacency representations), not on set-membership representations. This stage bridges the gap between the LSH output format and the graph algorithm input format.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment