Implementation:NVIDIA NeMo Curator LSHStage
| Attribute | Value |
|---|---|
| Domains | Data_Curation, Deduplication, Hashing |
| Implements | NVIDIA_NeMo_Curator_Locality_Sensitive_Hashing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
LSHStage is the NeMo Curator processing stage that groups similar documents into buckets by banding MinHash signatures using Locality Sensitive Hashing for sub-linear nearest-neighbor search.
Description
LSHStage is a dataclass-based ProcessingStage[FileGroupTask, FileGroupTask] that reads MinHash signature Parquet files and produces bucket assignment Parquet files. For each band, the stage hashes the corresponding slice of each document's MinHash signature into a bucket ID, then writes out the mapping from bucket IDs to document IDs.
The stage supports iterative band processing via bands_per_iteration, which limits how many bands are processed in a single GPU pass. This is critical for managing GPU memory when processing large datasets. GPU memory management is further controlled by rmm_pool_size and spill_memory_limit parameters.
Usage
from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage
lsh_stage = LSHStage(
num_bands=20,
minhashes_per_band=13,
output_path="/output/lsh_buckets/",
bands_per_iteration=5,
)
# Execute within a pipeline
output_tasks = lsh_stage.process(minhash_task)
Code Reference
Source Location
nemo_curator/stages/deduplication/fuzzy/lsh/stage.py, lines 29–184.
Signature
@dataclass
class LSHStage(ProcessingStage[FileGroupTask, FileGroupTask]):
num_bands: int
minhashes_per_band: int
id_field: str = "_curator_dedup_id"
minhash_field: str = "_minhash_signature"
output_path: str = "./"
bands_per_iteration: int = 5
rmm_pool_size: int | Literal["auto"] | None = "auto"
spill_memory_limit: int | Literal["auto"] | None = "auto"
...
Import
from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | FileGroupTask |
A task whose .data contains paths to MinHash Parquet files with _curator_dedup_id and _minhash_signature columns
|
| Output | FileGroupTask |
A task whose .data contains paths to LSH bucket Parquet files with _bucket_id and _curator_dedup_id columns
|
| Output Column | _bucket_id |
Hash value identifying the bucket for a given band |
| Output Column | _curator_dedup_id |
The document ID carried through from the MinHash stage |
| Parameters | num_bands |
Number of bands to split the MinHash signature into |
| Parameters | minhashes_per_band |
Number of hash values per band (rows per band) |
| Parameters | bands_per_iteration |
Number of bands processed in each GPU pass (default: 5) |
| Parameters | rmm_pool_size |
RAPIDS Memory Manager pool size (default: "auto")
|
| Parameters | spill_memory_limit |
Memory limit before spilling to disk (default: "auto")
|
Usage Examples
Example 1: Standard LSH configuration for 260-hash signatures
from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage
stage = LSHStage(
num_bands=20,
minhashes_per_band=13,
output_path="/output/lsh_buckets/",
)
Example 2: High-threshold LSH for stricter deduplication
from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage
# Fewer bands with more hashes per band = higher threshold (~0.85)
stage = LSHStage(
num_bands=10,
minhashes_per_band=26,
output_path="/output/lsh_strict/",
bands_per_iteration=2,
)
Example 3: Memory-constrained environment
from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage
stage = LSHStage(
num_bands=20,
minhashes_per_band=13,
output_path="/output/lsh_buckets/",
bands_per_iteration=2, # Process fewer bands per pass
rmm_pool_size="8GiB", # Explicit memory pool
spill_memory_limit="16GiB", # Spill to disk after 16 GiB
)
Related Pages
- Principle:NVIDIA_NeMo_Curator_Locality_Sensitive_Hashing
- NVIDIA_NeMo_Curator_MinHashStage — Upstream stage that produces MinHash signatures
- NVIDIA_NeMo_Curator_BucketsToEdgesStage — Downstream stage that converts buckets to pairwise edges
- NVIDIA_NeMo_Curator_FuzzyDeduplicationWorkflow — The parent workflow orchestrating all stages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- Environment:NVIDIA_NeMo_Curator_RAPIDS_GPU_Stack
- Environment:NVIDIA_NeMo_Curator_Ray_Cluster
- Heuristic:NVIDIA_NeMo_Curator_Deduplication_Blocksize_Tuning