Implementation:NVIDIA NeMo Curator FuzzyDeduplicationWorkflow

Knowledge Sources	NeMo Curator
Domains	Data_Curation, NLP, Deduplication
Last Updated	2026-02-14 17:00 GMT

Overview

Concrete workflow for GPU-accelerated fuzzy deduplication using MinHash and LSH provided by NeMo Curator.

Description

The FuzzyDeduplicationWorkflow orchestrates a multi-stage pipeline for near-duplicate detection: FilePartitioning → MinHash signature computation → LSH bucketing → Bucket-to-edge conversion → Connected components → Duplicate identification. It uses RAPIDS cuDF for GPU acceleration and Ray for distributed execution. The workflow also supports ExactDeduplicationWorkflow and TextDuplicatesRemovalWorkflow for exact dedup and removal respectively.

Usage

Use this workflow when you need to remove near-duplicate documents from a large text corpus. Configure char_ngrams, num_bands, and minhashes_per_band to control the precision/recall tradeoff.

Code Reference

Source Location

Repository: NeMo Curator
File: nemo_curator/stages/deduplication/fuzzy/workflow.py
Lines: L41-373

Signature

class FuzzyDeduplicationWorkflow(WorkflowBase):
    def __init__(
        self,
        cache_path: str,
        output_path: str,
        input_path: str,
        char_ngrams: int = 24,
        num_bands: int = 20,
        minhashes_per_band: int = 13,
        text_field: str = "text",
        seed: int = 42,
        use_64bit_hash: bool = False,
        assign_id: bool = True,
    ):
        """
        Args:
            cache_path: Intermediate storage for dedup stages.
            output_path: Where to write duplicate IDs.
            input_path: Path to input documents.
            char_ngrams: Character n-gram shingle width (default 24).
            num_bands: Number of LSH bands (default 20).
            minhashes_per_band: Hashes per band (default 13, total 260).
            text_field: Column name for text content.
            seed: Random seed for MinHash permutations.
            use_64bit_hash: Use 64-bit vs 32-bit hashing.
            assign_id: Assign curator dedup IDs if not present.
        """

Import

from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow

I/O Contract

Inputs

Name	Type	Required	Description
input_path	str	Yes	Path to input document files (parquet/jsonl)

Outputs

Name	Type	Description
result	WorkflowRunResult	Contains total_time, num_duplicates, id_generator_path
duplicate_ids	Parquet files	Written to output_path with document IDs to remove

Usage Examples

Basic Fuzzy Deduplication

from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow

# Create and run fuzzy dedup workflow
workflow = FuzzyDeduplicationWorkflow(
    cache_path="./cache/fuzzy_dedup",
    output_path="./output/duplicate_ids",
    input_path="./data/documents",
    char_ngrams=24,
    num_bands=20,
    minhashes_per_band=13,
)

result = workflow.run()
print(f"Found {result.num_duplicates} duplicates in {result.total_time}s")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment