Implementation:NVIDIA NeMo Curator Fuzzy IdentifyDuplicatesStage
| Attribute | Value |
|---|---|
| Domains | Data_Curation, Deduplication |
| Implements | NVIDIA_NeMo_Curator_Fuzzy_Duplicate_Identification |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
IdentifyDuplicatesStage is the NeMo Curator processing stage that selects which documents to remove from each duplicate cluster, retaining one representative per group using a first-encountered strategy.
Description
IdentifyDuplicatesStage extends ShuffleStage, which provides distributed shuffle (repartition) capabilities. The stage reads connected component Parquet files containing _curator_dedup_id and _duplicate_group_id columns, shuffles data so that all members of the same group are co-located, applies cuDF.duplicated(keep="first") within each group, and outputs the document IDs of all documents marked for removal.
The stage also accepts a total_nparts parameter to control the number of output partitions, which affects the parallelism of downstream filtering operations.
Usage
from nemo_curator.stages.deduplication.fuzzy.identify_duplicates import IdentifyDuplicatesStage
identify_stage = IdentifyDuplicatesStage(
output_path="/output/duplicates_to_remove/",
duplicate_group_field="_duplicate_group_id",
document_id_field="_curator_dedup_id",
)
# Execute within a pipeline
removal_tasks = identify_stage.process(cc_task)
Code Reference
Source Location
nemo_curator/stages/deduplication/fuzzy/identify_duplicates.py, lines 30–147.
Signature
class IdentifyDuplicatesStage(ShuffleStage):
def __init__(
self,
duplicate_group_field: str = "_duplicate_group_id",
document_id_field: str = "_curator_dedup_id",
output_path: str = "./",
total_nparts: int | None = None,
...
)
Import
from nemo_curator.stages.deduplication.fuzzy.identify_duplicates import IdentifyDuplicatesStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | FileGroupTask |
Connected component Parquet files with _curator_dedup_id and _duplicate_group_id columns
|
| Output | list[FileGroupTask] |
Parquet files containing only the _curator_dedup_id values of documents to remove (all duplicates except the first-encountered representative per group)
|
| Output Column | _curator_dedup_id |
Document IDs of documents flagged for removal |
| Parameters | duplicate_group_field |
Name of the column containing group labels (default: "_duplicate_group_id")
|
| Parameters | document_id_field |
Name of the column containing document IDs (default: "_curator_dedup_id")
|
| Parameters | output_path |
Directory path where removal-list Parquet files are written |
| Parameters | total_nparts |
Optional number of output partitions (controls downstream parallelism) |
Usage Examples
Example 1: Standard duplicate identification
from nemo_curator.stages.deduplication.fuzzy.identify_duplicates import IdentifyDuplicatesStage
stage = IdentifyDuplicatesStage(
output_path="/output/duplicates_to_remove/",
duplicate_group_field="_duplicate_group_id",
document_id_field="_curator_dedup_id",
)
Example 2: With controlled output partitioning
from nemo_curator.stages.deduplication.fuzzy.identify_duplicates import IdentifyDuplicatesStage
stage = IdentifyDuplicatesStage(
output_path="/output/duplicates_to_remove/",
total_nparts=128, # Control number of output files
)
Example 3: Full pipeline integration from connected components to removal list
from nemo_curator.stages.deduplication.fuzzy.connected_components import ConnectedComponentsStage
from nemo_curator.stages.deduplication.fuzzy.identify_duplicates import IdentifyDuplicatesStage
# Connected component analysis (previous stage)
cc_stage = ConnectedComponentsStage(output_path="/output/cc/")
cc_tasks = cc_stage.process_batch(edge_tasks)
# Identify duplicates to remove
identify_stage = IdentifyDuplicatesStage(
output_path="/output/duplicates_to_remove/",
)
# Process each connected component result
removal_tasks = []
for cc_task in cc_tasks:
removal_tasks.extend(identify_stage.process(cc_task))
# removal_tasks now contain IDs of documents to filter out
Related Pages
- Principle:NVIDIA_NeMo_Curator_Fuzzy_Duplicate_Identification
- NVIDIA_NeMo_Curator_ConnectedComponentsStage — Upstream stage that identifies duplicate groups
- NVIDIA_NeMo_Curator_FuzzyDeduplicationWorkflow — The parent workflow orchestrating all stages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- Environment:NVIDIA_NeMo_Curator_RAPIDS_GPU_Stack
- Environment:NVIDIA_NeMo_Curator_Ray_Cluster