Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator MinHashStage

From Leeroopedia
Implementation Metadata
Attribute Value
Domains Data_Curation, Deduplication, Hashing
Implements NVIDIA_NeMo_Curator_MinHash_Signature_Computation
Last Updated 2026-02-14 17:00 GMT

Overview

MinHashStage is the NeMo Curator processing stage that computes locality-sensitive MinHash signatures from document text using GPU-accelerated character n-gram shingling.

Description

MinHashStage implements the ProcessingStage[FileGroupTask, FileGroupTask] interface and also mixes in DeduplicationIO for standardized deduplication file handling. It reads document files (JSONL or Parquet), extracts the text field, computes MinHash signatures using cuDF.Series.str.minhash(), and writes the output as Parquet files containing the unique document ID and the MinHash signature array.

The stage assigns each document a unique _curator_dedup_id (a 64-bit integer combining file index and row index) and computes a fixed-size hash signature stored in the _minhash_signature column. These signatures are consumed by the downstream LSH stage for bucket assignment.

Usage

from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage

minhash_stage = MinHashStage(
    output_path="/output/minhashes/",
    text_field="text",
    char_ngrams=24,
    num_hashes=260,
    seed=42,
    use_64bit_hash=False,
    read_format="jsonl",
)

# Execute within a pipeline
output_tasks = minhash_stage.process(file_group_task)

Code Reference

Source Location

nemo_curator/stages/deduplication/fuzzy/minhash.py, lines 179–341.

Signature

class MinHashStage(ProcessingStage[FileGroupTask, FileGroupTask], DeduplicationIO):
    def __init__(
        self,
        output_path: str,
        text_field: str = "text",
        char_ngrams: int = 24,
        num_hashes: int = 260,
        seed: int = 42,
        use_64bit_hash: bool = False,
        read_format: Literal["jsonl", "parquet"] = "jsonl",
        ...
    )

Import

from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage

I/O Contract

I/O Contract
Direction Type Description
Input FileGroupTask A task whose .data contains a list of document file paths (JSONL or Parquet)
Output FileGroupTask A task whose .data contains paths to Parquet files with _curator_dedup_id and _minhash_signature columns
Output Column _curator_dedup_id 64-bit integer uniquely identifying each document (combines file index + row index)
Output Column _minhash_signature Array of num_hashes hash values representing the document's MinHash signature
Parameters text_field Name of the column containing document text (default: "text")
Parameters char_ngrams Length of character n-grams for shingling (default: 24)
Parameters num_hashes Number of MinHash values per signature (default: 260)
Parameters seed Random seed for hash function generation (default: 42)
Parameters use_64bit_hash Whether to use 64-bit hash values instead of 32-bit (default: False)

Usage Examples

Example 1: Basic MinHash computation on JSONL files

from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage

stage = MinHashStage(
    output_path="/output/minhashes/",
    text_field="text",
    char_ngrams=24,
    num_hashes=260,
    seed=42,
    read_format="jsonl",
)

Example 2: High-precision MinHash with 64-bit hashes

from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage

stage = MinHashStage(
    output_path="/output/minhashes_64bit/",
    text_field="content",
    char_ngrams=32,
    num_hashes=512,
    use_64bit_hash=True,
    read_format="parquet",
)

Example 3: MinHash for short-text deduplication

from nemo_curator.stages.deduplication.fuzzy.minhash import MinHashStage

stage = MinHashStage(
    output_path="/output/minhashes_short/",
    text_field="text",
    char_ngrams=5,       # shorter shingles for short texts
    num_hashes=128,
    seed=123,
    read_format="jsonl",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment