Workflow:NVIDIA NeMo Curator Fuzzy Deduplication

Knowledge Sources	NeMo Curator NeMo Curator Docs
Domains	Data_Engineering, NLP, Deduplication
Last Updated	2026-02-14 17:00 GMT

Overview

End-to-end process for detecting and removing near-duplicate text documents using MinHash-based Locality-Sensitive Hashing with GPU-accelerated connected component analysis.

Description

This workflow implements the FuzzyDeduplicationWorkflow which performs near-duplicate detection on text datasets using the MinHash + LSH algorithm family. The process computes MinHash signatures from character n-gram shingles for each document, groups documents into candidate duplicate buckets using Locality-Sensitive Hashing, converts bucket memberships into an edge graph, finds connected components (duplicate clusters) using GPU-accelerated graph algorithms, and produces a list of document IDs to remove. The workflow is implemented as a WorkflowBase subclass that orchestrates three internal sub-pipelines: a MinHash pipeline, an LSH pipeline, and a connected components pipeline. All stages use GPU-accelerated processing via RAPIDS cuDF and are executed on a RayActorPoolExecutor.

Usage

Execute this workflow when you have a text dataset (JSONL or Parquet) and need to identify and remove near-duplicate documents that share similar but not identical content. This is the standard approach for deduplicating web-crawled text data where exact duplicates have already been removed but near-duplicates (e.g., slightly reformatted or partially modified documents) remain.

Execution Steps

Step 1: File Partitioning

Discover and group input dataset files into balanced partitions for parallel processing. The FilePartitioningStage scans the input path for files matching the configured file type (JSONL or Parquet), computes file sizes, and creates FileGroupTask objects where each task contains a balanced set of files. The blocksize parameter controls the target size of each partition (default: 1 GiB).

Key considerations:

Supports both local and remote (S3-compatible) file paths via fsspec
File extensions can be explicitly configured or inferred from the input file type
Blocksize controls the granularity of work distribution across workers
This step is skipped if initial tasks are provided from a previous pipeline stage

Step 2: MinHash Signature Computation

Compute MinHash signatures for each document in the dataset. The MinHashStage reads documents from the input files, extracts character n-gram shingles of configurable length (default: 24 characters), and computes a set of MinHash values using multiple hash permutations. The number of hashes is determined by num_bands * minhashes_per_band (default: 20 * 13 = 260 hashes). A globally unique document ID is assigned to each document via the ID generator actor. Signatures are written to Parquet files in the cache directory.

Key considerations:

Character n-grams of at least 20 characters are recommended to limit false positives
The number of hash permutations controls the tradeoff between accuracy and computation cost
64-bit hashing is optional and reduces collision probability for very large datasets
The ID generator actor maintains a global mapping of document IDs across workers
Output MinHash signatures are written to cache_path/MinHashStage/

Step 3: Locality-Sensitive Hashing

Group documents into candidate duplicate buckets using LSH on their MinHash signatures. The LSHStage reads the computed MinHash signatures from the cache directory, divides them into bands (groups of consecutive hashes), and hashes each band to produce bucket assignments. Documents sharing the same bucket in any band are candidate duplicates. The stage processes bands in configurable iterations (bands_per_iteration) to manage memory usage during the shuffle operation that redistributes data by bucket ID.

Key considerations:

The number of bands and hashes per band control the sensitivity of duplicate detection
More bands increase recall (finding more duplicates) at the cost of more candidate pairs
The bands_per_iteration parameter controls memory usage during shuffling
RMM pool size and spill memory limits are auto-configured for GPU memory management
Output bucket assignments are written to cache_path/LSHStage/

Step 4: Buckets to Edges Conversion

Convert LSH bucket membership into a graph edge list for connected component analysis. The BucketsToEdgesStage reads the bucket assignments and generates pairs of document IDs that co-occur in the same bucket. Each pair represents an edge in the duplicate graph. This transforms the bucket-centric representation into a graph representation suitable for connected component computation.

Key considerations:

Large buckets generate many edges (quadratic in bucket size)
Output edges are written to cache_path/BucketsToEdgesStage/
Memory requirements scale with the number of candidate duplicate pairs

Step 5: Connected Components

Compute weakly connected components on the duplicate graph to identify clusters of near-duplicate documents. The ConnectedComponentsStage uses GPU-accelerated graph algorithms (via RAFT NCCL communications for multi-GPU) to find all connected components in the edge graph. Each connected component represents a group of documents that are transitively near-duplicates of each other.

Key considerations:

Uses GPU-accelerated graph algorithms for efficient component computation
Multi-GPU support via RAFT NCCL communications for large graphs
Each connected component is assigned a unique cluster ID
Output component assignments are written to cache_path/ConnectedComponentsStage/

Step 6: Duplicate Identification

Generate the final list of document IDs to remove based on connected component membership. The IdentifyDuplicatesStage reads the connected component assignments and selects which documents within each component to keep (one representative) and which to remove (all others). The removal IDs are written to the output directory as Parquet files.

Key considerations:

One document per connected component is retained as the representative
Removal IDs are written to output_path for use by downstream removal workflows
The ID generator mapping is also saved to output_path for ID resolution during removal
RMM pool size and spill memory limits are auto-configured
The TextDuplicatesRemovalWorkflow can be used to physically remove identified duplicates

Execution Diagram

GitHub URL

Workflow Repository