Principle:NVIDIA NeMo Curator MinHash Signature Computation
| Attribute | Value |
|---|---|
| Knowledge Sources | Paper: MinHash and LSH, Paper: Deduplicating Training Data |
| Domains | Data_Curation, Deduplication, Hashing |
| Implemented By | NVIDIA_NeMo_Curator_MinHashStage |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
MinHash Signature Computation is a technique for generating locality-sensitive hash signatures from text using character n-gram shingling for approximate nearest-neighbor search.
Description
MinHash transforms documents into fixed-size hash signatures that preserve Jaccard similarity, enabling efficient near-duplicate detection at scale. The process works in two phases:
- Shingling — Each document is decomposed into a set of overlapping character n-grams (shingles). For example, the text "hello" with n=3 yields the shingle set {"hel", "ell", "llo"}.
- MinHash computation — For each of
num_hashesindependent hash functions, the minimum hash value across all shingles is computed. The resulting vector of minimum hash values forms the document's MinHash signature.
The key property of MinHash is that the probability of two documents sharing the same MinHash value for a given hash function equals the Jaccard similarity between their shingle sets. By computing multiple independent MinHash values, the signature provides an increasingly accurate estimate of the true Jaccard similarity.
In NeMo Curator, MinHash computation is GPU-accelerated using cuDF.Series.str.minhash() (or its 64-bit variant minhash64()), which processes millions of documents per second on modern NVIDIA GPUs.
Usage
MinHash Signature Computation is the second stage in the fuzzy deduplication pipeline, immediately following File Partitioning. It reads document files (JSONL or Parquet), computes MinHash signatures, and writes the results as Parquet files containing the document ID (_curator_dedup_id) and the signature array (_minhash_signature).
from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage
stage = MinHashStage(
output_path="/output/minhashes/",
text_field="text",
char_ngrams=24,
num_hashes=260,
seed=42,
use_64bit_hash=False,
read_format="jsonl",
)
Theoretical Basis
MinHash provides an unbiased estimator of the Jaccard similarity between two sets:
where A and B are sets of character n-gram shingles extracted from two documents. The fundamental theorem states:
for a hash function h drawn from a min-wise independent family. By computing k independent MinHash values, the fraction of matching values between two signatures provides an estimate of Jaccard similarity with standard error .
Key parameters and their effects:
- char_ngrams (default 24) — The length of character n-grams used for shingling. Longer shingles are more sensitive to small differences, making the deduplication more precise but less tolerant of minor variations.
- num_hashes (default 260) — The number of independent hash functions. More hashes improve the accuracy of the similarity estimate but increase signature storage and computation cost.
- use_64bit_hash — Using 64-bit hashes reduces the probability of hash collisions at the cost of doubling the signature size.
The choice of 260 hashes is designed to work with common LSH band configurations (e.g., 20 bands of 13 hashes each), providing a good balance between detection sensitivity and computational efficiency.
Related Pages
- Implementation:NVIDIA_NeMo_Curator_MinHashStage
- NVIDIA_NeMo_Curator_Locality_Sensitive_Hashing — The next stage that uses MinHash signatures for LSH bucketing
- NVIDIA_NeMo_Curator_File_Partitioning — The preceding stage that partitions input files
- NVIDIA_NeMo_Curator_Text_Deduplication — The parent concept covering all deduplication techniques