Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator FuzzyDeduplicationWorkflow

From Leeroopedia
Knowledge Sources
Domains Data_Curation, NLP, Deduplication
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete workflow for GPU-accelerated fuzzy deduplication using MinHash and LSH provided by NeMo Curator.

Description

The FuzzyDeduplicationWorkflow orchestrates a multi-stage pipeline for near-duplicate detection: FilePartitioning → MinHash signature computation → LSH bucketing → Bucket-to-edge conversion → Connected components → Duplicate identification. It uses RAPIDS cuDF for GPU acceleration and Ray for distributed execution. The workflow also supports ExactDeduplicationWorkflow and TextDuplicatesRemovalWorkflow for exact dedup and removal respectively.

Usage

Use this workflow when you need to remove near-duplicate documents from a large text corpus. Configure char_ngrams, num_bands, and minhashes_per_band to control the precision/recall tradeoff.

Code Reference

Source Location

  • Repository: NeMo Curator
  • File: nemo_curator/stages/deduplication/fuzzy/workflow.py
  • Lines: L41-373

Signature

class FuzzyDeduplicationWorkflow(WorkflowBase):
    def __init__(
        self,
        cache_path: str,
        output_path: str,
        input_path: str,
        char_ngrams: int = 24,
        num_bands: int = 20,
        minhashes_per_band: int = 13,
        text_field: str = "text",
        seed: int = 42,
        use_64bit_hash: bool = False,
        assign_id: bool = True,
    ):
        """
        Args:
            cache_path: Intermediate storage for dedup stages.
            output_path: Where to write duplicate IDs.
            input_path: Path to input documents.
            char_ngrams: Character n-gram shingle width (default 24).
            num_bands: Number of LSH bands (default 20).
            minhashes_per_band: Hashes per band (default 13, total 260).
            text_field: Column name for text content.
            seed: Random seed for MinHash permutations.
            use_64bit_hash: Use 64-bit vs 32-bit hashing.
            assign_id: Assign curator dedup IDs if not present.
        """

Import

from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow

I/O Contract

Inputs

Name Type Required Description
input_path str Yes Path to input document files (parquet/jsonl)

Outputs

Name Type Description
result WorkflowRunResult Contains total_time, num_duplicates, id_generator_path
duplicate_ids Parquet files Written to output_path with document IDs to remove

Usage Examples

Basic Fuzzy Deduplication

from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow

# Create and run fuzzy dedup workflow
workflow = FuzzyDeduplicationWorkflow(
    cache_path="./cache/fuzzy_dedup",
    output_path="./output/duplicate_ids",
    input_path="./data/documents",
    char_ngrams=24,
    num_bands=20,
    minhashes_per_band=13,
)

result = workflow.run()
print(f"Found {result.num_duplicates} duplicates in {result.total_time}s")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment