Workflow:NVIDIA NeMo Curator Fuzzy Deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Deduplication |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
End-to-end process for detecting and removing near-duplicate text documents using MinHash-based Locality-Sensitive Hashing with GPU-accelerated connected component analysis.
Description
This workflow implements the FuzzyDeduplicationWorkflow which performs near-duplicate detection on text datasets using the MinHash + LSH algorithm family. The process computes MinHash signatures from character n-gram shingles for each document, groups documents into candidate duplicate buckets using Locality-Sensitive Hashing, converts bucket memberships into an edge graph, finds connected components (duplicate clusters) using GPU-accelerated graph algorithms, and produces a list of document IDs to remove. The workflow is implemented as a WorkflowBase subclass that orchestrates three internal sub-pipelines: a MinHash pipeline, an LSH pipeline, and a connected components pipeline. All stages use GPU-accelerated processing via RAPIDS cuDF and are executed on a RayActorPoolExecutor.
Usage
Execute this workflow when you have a text dataset (JSONL or Parquet) and need to identify and remove near-duplicate documents that share similar but not identical content. This is the standard approach for deduplicating web-crawled text data where exact duplicates have already been removed but near-duplicates (e.g., slightly reformatted or partially modified documents) remain.
Execution Steps
Step 1: File Partitioning
Discover and group input dataset files into balanced partitions for parallel processing. The FilePartitioningStage scans the input path for files matching the configured file type (JSONL or Parquet), computes file sizes, and creates FileGroupTask objects where each task contains a balanced set of files. The blocksize parameter controls the target size of each partition (default: 1 GiB).
Key considerations:
- Supports both local and remote (S3-compatible) file paths via fsspec
- File extensions can be explicitly configured or inferred from the input file type
- Blocksize controls the granularity of work distribution across workers
- This step is skipped if initial tasks are provided from a previous pipeline stage
Step 2: MinHash Signature Computation
Compute MinHash signatures for each document in the dataset. The MinHashStage reads documents from the input files, extracts character n-gram shingles of configurable length (default: 24 characters), and computes a set of MinHash values using multiple hash permutations. The number of hashes is determined by num_bands * minhashes_per_band (default: 20 * 13 = 260 hashes). A globally unique document ID is assigned to each document via the ID generator actor. Signatures are written to Parquet files in the cache directory.
Key considerations:
- Character n-grams of at least 20 characters are recommended to limit false positives
- The number of hash permutations controls the tradeoff between accuracy and computation cost
- 64-bit hashing is optional and reduces collision probability for very large datasets
- The ID generator actor maintains a global mapping of document IDs across workers
- Output MinHash signatures are written to cache_path/MinHashStage/
Step 3: Locality-Sensitive Hashing
Group documents into candidate duplicate buckets using LSH on their MinHash signatures. The LSHStage reads the computed MinHash signatures from the cache directory, divides them into bands (groups of consecutive hashes), and hashes each band to produce bucket assignments. Documents sharing the same bucket in any band are candidate duplicates. The stage processes bands in configurable iterations (bands_per_iteration) to manage memory usage during the shuffle operation that redistributes data by bucket ID.
Key considerations:
- The number of bands and hashes per band control the sensitivity of duplicate detection
- More bands increase recall (finding more duplicates) at the cost of more candidate pairs
- The bands_per_iteration parameter controls memory usage during shuffling
- RMM pool size and spill memory limits are auto-configured for GPU memory management
- Output bucket assignments are written to cache_path/LSHStage/
Step 4: Buckets to Edges Conversion
Convert LSH bucket membership into a graph edge list for connected component analysis. The BucketsToEdgesStage reads the bucket assignments and generates pairs of document IDs that co-occur in the same bucket. Each pair represents an edge in the duplicate graph. This transforms the bucket-centric representation into a graph representation suitable for connected component computation.
Key considerations:
- Large buckets generate many edges (quadratic in bucket size)
- Output edges are written to cache_path/BucketsToEdgesStage/
- Memory requirements scale with the number of candidate duplicate pairs
Step 5: Connected Components
Compute weakly connected components on the duplicate graph to identify clusters of near-duplicate documents. The ConnectedComponentsStage uses GPU-accelerated graph algorithms (via RAFT NCCL communications for multi-GPU) to find all connected components in the edge graph. Each connected component represents a group of documents that are transitively near-duplicates of each other.
Key considerations:
- Uses GPU-accelerated graph algorithms for efficient component computation
- Multi-GPU support via RAFT NCCL communications for large graphs
- Each connected component is assigned a unique cluster ID
- Output component assignments are written to cache_path/ConnectedComponentsStage/
Step 6: Duplicate Identification
Generate the final list of document IDs to remove based on connected component membership. The IdentifyDuplicatesStage reads the connected component assignments and selects which documents within each component to keep (one representative) and which to remove (all others). The removal IDs are written to the output directory as Parquet files.
Key considerations:
- One document per connected component is retained as the representative
- Removal IDs are written to output_path for use by downstream removal workflows
- The ID generator mapping is also saved to output_path for ID resolution during removal
- RMM pool size and spill memory limits are auto-configured
- The TextDuplicatesRemovalWorkflow can be used to physically remove identified duplicates