Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator LSHStage

From Leeroopedia
Implementation Metadata
Attribute Value
Domains Data_Curation, Deduplication, Hashing
Implements NVIDIA_NeMo_Curator_Locality_Sensitive_Hashing
Last Updated 2026-02-14 17:00 GMT

Overview

LSHStage is the NeMo Curator processing stage that groups similar documents into buckets by banding MinHash signatures using Locality Sensitive Hashing for sub-linear nearest-neighbor search.

Description

LSHStage is a dataclass-based ProcessingStage[FileGroupTask, FileGroupTask] that reads MinHash signature Parquet files and produces bucket assignment Parquet files. For each band, the stage hashes the corresponding slice of each document's MinHash signature into a bucket ID, then writes out the mapping from bucket IDs to document IDs.

The stage supports iterative band processing via bands_per_iteration, which limits how many bands are processed in a single GPU pass. This is critical for managing GPU memory when processing large datasets. GPU memory management is further controlled by rmm_pool_size and spill_memory_limit parameters.

Usage

from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage

lsh_stage = LSHStage(
    num_bands=20,
    minhashes_per_band=13,
    output_path="/output/lsh_buckets/",
    bands_per_iteration=5,
)

# Execute within a pipeline
output_tasks = lsh_stage.process(minhash_task)

Code Reference

Source Location

nemo_curator/stages/deduplication/fuzzy/lsh/stage.py, lines 29–184.

Signature

@dataclass
class LSHStage(ProcessingStage[FileGroupTask, FileGroupTask]):
    num_bands: int
    minhashes_per_band: int
    id_field: str = "_curator_dedup_id"
    minhash_field: str = "_minhash_signature"
    output_path: str = "./"
    bands_per_iteration: int = 5
    rmm_pool_size: int | Literal["auto"] | None = "auto"
    spill_memory_limit: int | Literal["auto"] | None = "auto"
    ...

Import

from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage

I/O Contract

I/O Contract
Direction Type Description
Input FileGroupTask A task whose .data contains paths to MinHash Parquet files with _curator_dedup_id and _minhash_signature columns
Output FileGroupTask A task whose .data contains paths to LSH bucket Parquet files with _bucket_id and _curator_dedup_id columns
Output Column _bucket_id Hash value identifying the bucket for a given band
Output Column _curator_dedup_id The document ID carried through from the MinHash stage
Parameters num_bands Number of bands to split the MinHash signature into
Parameters minhashes_per_band Number of hash values per band (rows per band)
Parameters bands_per_iteration Number of bands processed in each GPU pass (default: 5)
Parameters rmm_pool_size RAPIDS Memory Manager pool size (default: "auto")
Parameters spill_memory_limit Memory limit before spilling to disk (default: "auto")

Usage Examples

Example 1: Standard LSH configuration for 260-hash signatures

from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage

stage = LSHStage(
    num_bands=20,
    minhashes_per_band=13,
    output_path="/output/lsh_buckets/",
)

Example 2: High-threshold LSH for stricter deduplication

from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage

# Fewer bands with more hashes per band = higher threshold (~0.85)
stage = LSHStage(
    num_bands=10,
    minhashes_per_band=26,
    output_path="/output/lsh_strict/",
    bands_per_iteration=2,
)

Example 3: Memory-constrained environment

from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage

stage = LSHStage(
    num_bands=20,
    minhashes_per_band=13,
    output_path="/output/lsh_buckets/",
    bands_per_iteration=2,          # Process fewer bands per pass
    rmm_pool_size="8GiB",           # Explicit memory pool
    spill_memory_limit="16GiB",     # Spill to disk after 16 GiB
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment