Implementation:NVIDIA NeMo Curator FuzzyDeduplicationWorkflow
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, NLP, Deduplication |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete workflow for GPU-accelerated fuzzy deduplication using MinHash and LSH provided by NeMo Curator.
Description
The FuzzyDeduplicationWorkflow orchestrates a multi-stage pipeline for near-duplicate detection: FilePartitioning → MinHash signature computation → LSH bucketing → Bucket-to-edge conversion → Connected components → Duplicate identification. It uses RAPIDS cuDF for GPU acceleration and Ray for distributed execution. The workflow also supports ExactDeduplicationWorkflow and TextDuplicatesRemovalWorkflow for exact dedup and removal respectively.
Usage
Use this workflow when you need to remove near-duplicate documents from a large text corpus. Configure char_ngrams, num_bands, and minhashes_per_band to control the precision/recall tradeoff.
Code Reference
Source Location
- Repository: NeMo Curator
- File: nemo_curator/stages/deduplication/fuzzy/workflow.py
- Lines: L41-373
Signature
class FuzzyDeduplicationWorkflow(WorkflowBase):
def __init__(
self,
cache_path: str,
output_path: str,
input_path: str,
char_ngrams: int = 24,
num_bands: int = 20,
minhashes_per_band: int = 13,
text_field: str = "text",
seed: int = 42,
use_64bit_hash: bool = False,
assign_id: bool = True,
):
"""
Args:
cache_path: Intermediate storage for dedup stages.
output_path: Where to write duplicate IDs.
input_path: Path to input documents.
char_ngrams: Character n-gram shingle width (default 24).
num_bands: Number of LSH bands (default 20).
minhashes_per_band: Hashes per band (default 13, total 260).
text_field: Column name for text content.
seed: Random seed for MinHash permutations.
use_64bit_hash: Use 64-bit vs 32-bit hashing.
assign_id: Assign curator dedup IDs if not present.
"""
Import
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_path | str | Yes | Path to input document files (parquet/jsonl) |
Outputs
| Name | Type | Description |
|---|---|---|
| result | WorkflowRunResult | Contains total_time, num_duplicates, id_generator_path |
| duplicate_ids | Parquet files | Written to output_path with document IDs to remove |
Usage Examples
Basic Fuzzy Deduplication
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
# Create and run fuzzy dedup workflow
workflow = FuzzyDeduplicationWorkflow(
cache_path="./cache/fuzzy_dedup",
output_path="./output/duplicate_ids",
input_path="./data/documents",
char_ngrams=24,
num_bands=20,
minhashes_per_band=13,
)
result = workflow.run()
print(f"Found {result.num_duplicates} duplicates in {result.total_time}s")