Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Curator Locality Sensitive Hashing

From Leeroopedia
Principle Metadata
Attribute Value
Knowledge Sources Paper: LSH
Domains Data_Curation, Deduplication, Hashing
Implemented By NVIDIA_NeMo_Curator_LSHStage
Last Updated 2026-02-14 17:00 GMT

Overview

Locality Sensitive Hashing (LSH) is a technique for grouping similar documents into buckets by banding MinHash signatures to enable sub-linear nearest-neighbor search.

Description

LSH divides MinHash signatures into bands, hashes each band, and places documents sharing a band hash into the same bucket. This reduces the O(n^2) comparison problem to O(n) expected comparisons.

The process operates as follows:

  1. Band division — The MinHash signature of length num_hashes is split into b bands, each containing r consecutive hash values (where b * r = num_hashes).
  2. Band hashing — For each band, the r hash values are combined into a single band hash using a secondary hash function.
  3. Bucket assignment — Documents that produce the same band hash for any given band are placed into the same bucket. Each band produces its own independent set of buckets.

Two documents are considered candidate duplicates if they share at least one bucket across any band. The probability of this occurring is determined by the S-curve threshold function, which provides a tunable tradeoff between precision and recall.

Usage

LSH is the third stage in the fuzzy deduplication pipeline, following MinHash Signature Computation. It reads MinHash signature Parquet files and produces bucket assignment Parquet files that map bucket IDs to document IDs.

from nemo_curator.stages.deduplication.fuzzy.lsh.stage import LSHStage

stage = LSHStage(
    num_bands=20,
    minhashes_per_band=13,
    output_path="/output/lsh_buckets/",
)

Theoretical Basis

With b bands of r hashes each, the probability of two documents with Jaccard similarity s being hashed to the same bucket in at least one band is:

P(s)=1(1sr)b

This creates an S-curve threshold function with the following properties:

  • Threshold — The similarity value at which the detection probability is approximately 50% is t(1/b)1/r.
  • Steepness — Higher values of b and r produce a steeper S-curve, meaning sharper discrimination between similar and dissimilar pairs.
  • False positive rate — Pairs with similarity below the threshold have a low but non-zero probability of being bucketed together.
  • False negative rate — Pairs with similarity above the threshold have a low but non-zero probability of being missed.

Example configurations:

Bands (b) Rows (r) Total Hashes Approx. Threshold
20 13 260 ~0.72
10 26 260 ~0.85
26 10 260 ~0.66

The bands_per_iteration parameter in the implementation controls how many bands are processed in a single GPU pass, allowing memory-constrained environments to process large signature sets by iterating over subsets of bands.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment