Principle:NVIDIA NeMo Curator Locality Sensitive Hashing
| Attribute | Value |
|---|---|
| Knowledge Sources | Paper: LSH |
| Domains | Data_Curation, Deduplication, Hashing |
| Implemented By | NVIDIA_NeMo_Curator_LSHStage |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Locality Sensitive Hashing (LSH) is a technique for grouping similar documents into buckets by banding MinHash signatures to enable sub-linear nearest-neighbor search.
Description
LSH divides MinHash signatures into bands, hashes each band, and places documents sharing a band hash into the same bucket. This reduces the O(n^2) comparison problem to O(n) expected comparisons.
The process operates as follows:
- Band division — The MinHash signature of length
num_hashesis split intobbands, each containingrconsecutive hash values (whereb * r = num_hashes). - Band hashing — For each band, the
rhash values are combined into a single band hash using a secondary hash function. - Bucket assignment — Documents that produce the same band hash for any given band are placed into the same bucket. Each band produces its own independent set of buckets.
Two documents are considered candidate duplicates if they share at least one bucket across any band. The probability of this occurring is determined by the S-curve threshold function, which provides a tunable tradeoff between precision and recall.
Usage
LSH is the third stage in the fuzzy deduplication pipeline, following MinHash Signature Computation. It reads MinHash signature Parquet files and produces bucket assignment Parquet files that map bucket IDs to document IDs.
from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage
stage = LSHStage(
num_bands=20,
minhashes_per_band=13,
output_path="/output/lsh_buckets/",
)
Theoretical Basis
With b bands of r hashes each, the probability of two documents with Jaccard similarity s being hashed to the same bucket in at least one band is:
This creates an S-curve threshold function with the following properties:
- Threshold — The similarity value at which the detection probability is approximately 50% is .
- Steepness — Higher values of
bandrproduce a steeper S-curve, meaning sharper discrimination between similar and dissimilar pairs. - False positive rate — Pairs with similarity below the threshold have a low but non-zero probability of being bucketed together.
- False negative rate — Pairs with similarity above the threshold have a low but non-zero probability of being missed.
Example configurations:
| Bands (b) | Rows (r) | Total Hashes | Approx. Threshold |
|---|---|---|---|
| 20 | 13 | 260 | ~0.72 |
| 10 | 26 | 260 | ~0.85 |
| 26 | 10 | 260 | ~0.66 |
The bands_per_iteration parameter in the implementation controls how many bands are processed in a single GPU pass, allowing memory-constrained environments to process large signature sets by iterating over subsets of bands.
Related Pages
- Implementation:NVIDIA_NeMo_Curator_LSHStage
- NVIDIA_NeMo_Curator_MinHash_Signature_Computation — The preceding stage that produces MinHash signatures
- NVIDIA_NeMo_Curator_Bucket_to_Edge_Conversion — The subsequent stage that converts buckets to edges
- NVIDIA_NeMo_Curator_Text_Deduplication — The parent concept covering all deduplication techniques