Implementation:NVIDIA NeMo Curator MinHashStage
| Attribute | Value |
|---|---|
| Domains | Data_Curation, Deduplication, Hashing |
| Implements | NVIDIA_NeMo_Curator_MinHash_Signature_Computation |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
MinHashStage is the NeMo Curator processing stage that computes locality-sensitive MinHash signatures from document text using GPU-accelerated character n-gram shingling.
Description
MinHashStage implements the ProcessingStage[FileGroupTask, FileGroupTask] interface and also mixes in DeduplicationIO for standardized deduplication file handling. It reads document files (JSONL or Parquet), extracts the text field, computes MinHash signatures using cuDF.Series.str.minhash(), and writes the output as Parquet files containing the unique document ID and the MinHash signature array.
The stage assigns each document a unique _curator_dedup_id (a 64-bit integer combining file index and row index) and computes a fixed-size hash signature stored in the _minhash_signature column. These signatures are consumed by the downstream LSH stage for bucket assignment.
Usage
from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage
minhash_stage = MinHashStage(
output_path="/output/minhashes/",
text_field="text",
char_ngrams=24,
num_hashes=260,
seed=42,
use_64bit_hash=False,
read_format="jsonl",
)
# Execute within a pipeline
output_tasks = minhash_stage.process(file_group_task)
Code Reference
Source Location
nemo_curator/stages/deduplication/fuzzy/minhash.py, lines 179–341.
Signature
class MinHashStage(ProcessingStage[FileGroupTask, FileGroupTask], DeduplicationIO):
def __init__(
self,
output_path: str,
text_field: str = "text",
char_ngrams: int = 24,
num_hashes: int = 260,
seed: int = 42,
use_64bit_hash: bool = False,
read_format: Literal["jsonl", "parquet"] = "jsonl",
...
)
Import
from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | FileGroupTask |
A task whose .data contains a list of document file paths (JSONL or Parquet)
|
| Output | FileGroupTask |
A task whose .data contains paths to Parquet files with _curator_dedup_id and _minhash_signature columns
|
| Output Column | _curator_dedup_id |
64-bit integer uniquely identifying each document (combines file index + row index) |
| Output Column | _minhash_signature |
Array of num_hashes hash values representing the document's MinHash signature
|
| Parameters | text_field |
Name of the column containing document text (default: "text")
|
| Parameters | char_ngrams |
Length of character n-grams for shingling (default: 24) |
| Parameters | num_hashes |
Number of MinHash values per signature (default: 260) |
| Parameters | seed |
Random seed for hash function generation (default: 42) |
| Parameters | use_64bit_hash |
Whether to use 64-bit hash values instead of 32-bit (default: False)
|
Usage Examples
Example 1: Basic MinHash computation on JSONL files
from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage
stage = MinHashStage(
output_path="/output/minhashes/",
text_field="text",
char_ngrams=24,
num_hashes=260,
seed=42,
read_format="jsonl",
)
Example 2: High-precision MinHash with 64-bit hashes
from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage
stage = MinHashStage(
output_path="/output/minhashes_64bit/",
text_field="content",
char_ngrams=32,
num_hashes=512,
use_64bit_hash=True,
read_format="parquet",
)
Example 3: MinHash for short-text deduplication
from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage
stage = MinHashStage(
output_path="/output/minhashes_short/",
text_field="text",
char_ngrams=5, # shorter shingles for short texts
num_hashes=128,
seed=123,
read_format="jsonl",
)
Related Pages
- Principle:NVIDIA_NeMo_Curator_MinHash_Signature_Computation
- NVIDIA_NeMo_Curator_FilePartitioningStage — Upstream stage that produces
FileGroupTaskinputs - NVIDIA_NeMo_Curator_LSHStage — Downstream stage that consumes MinHash signatures for bucketing
- NVIDIA_NeMo_Curator_FuzzyDeduplicationWorkflow — The parent workflow orchestrating all stages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- Environment:NVIDIA_NeMo_Curator_RAPIDS_GPU_Stack
- Environment:NVIDIA_NeMo_Curator_Ray_Cluster
- Heuristic:NVIDIA_NeMo_Curator_Deduplication_Blocksize_Tuning