Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Curator MinHash Signature Computation

From Leeroopedia
Principle Metadata
Attribute Value
Knowledge Sources Paper: MinHash and LSH, Paper: Deduplicating Training Data
Domains Data_Curation, Deduplication, Hashing
Implemented By NVIDIA_NeMo_Curator_MinHashStage
Last Updated 2026-02-14 17:00 GMT

Overview

MinHash Signature Computation is a technique for generating locality-sensitive hash signatures from text using character n-gram shingling for approximate nearest-neighbor search.

Description

MinHash transforms documents into fixed-size hash signatures that preserve Jaccard similarity, enabling efficient near-duplicate detection at scale. The process works in two phases:

  1. Shingling — Each document is decomposed into a set of overlapping character n-grams (shingles). For example, the text "hello" with n=3 yields the shingle set {"hel", "ell", "llo"}.
  2. MinHash computation — For each of num_hashes independent hash functions, the minimum hash value across all shingles is computed. The resulting vector of minimum hash values forms the document's MinHash signature.

The key property of MinHash is that the probability of two documents sharing the same MinHash value for a given hash function equals the Jaccard similarity between their shingle sets. By computing multiple independent MinHash values, the signature provides an increasingly accurate estimate of the true Jaccard similarity.

In NeMo Curator, MinHash computation is GPU-accelerated using cuDF.Series.str.minhash() (or its 64-bit variant minhash64()), which processes millions of documents per second on modern NVIDIA GPUs.

Usage

MinHash Signature Computation is the second stage in the fuzzy deduplication pipeline, immediately following File Partitioning. It reads document files (JSONL or Parquet), computes MinHash signatures, and writes the results as Parquet files containing the document ID (_curator_dedup_id) and the signature array (_minhash_signature).

from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage

stage = MinHashStage(
    output_path="/output/minhashes/",
    text_field="text",
    char_ngrams=24,
    num_hashes=260,
    seed=42,
    use_64bit_hash=False,
    read_format="jsonl",
)

Theoretical Basis

MinHash provides an unbiased estimator of the Jaccard similarity between two sets:

J(A,B)=|AB||AB|

where A and B are sets of character n-gram shingles extracted from two documents. The fundamental theorem states:

Pr[min(h(A))=min(h(B))]=J(A,B)

for a hash function h drawn from a min-wise independent family. By computing k independent MinHash values, the fraction of matching values between two signatures provides an estimate of Jaccard similarity with standard error O(1/k).

Key parameters and their effects:

  • char_ngrams (default 24) — The length of character n-grams used for shingling. Longer shingles are more sensitive to small differences, making the deduplication more precise but less tolerant of minor variations.
  • num_hashes (default 260) — The number of independent hash functions. More hashes improve the accuracy of the similarity estimate but increase signature storage and computation cost.
  • use_64bit_hash — Using 64-bit hashes reduces the probability of hash collisions at the cost of doubling the signature size.

The choice of 260 hashes is designed to work with common LSH band configurations (e.g., 20 bands of 13 hashes each), providing a good balance between detection sensitivity and computational efficiency.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment