Implementation:NVIDIA NeMo Curator Semantic IdentifyDuplicatesStage
| Implementation Metadata | |
|---|---|
| Knowledge Sources | |
| Domains | |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
IdentifyDuplicatesStage is a processing stage that applies an epsilon threshold to pairwise similarity scores to produce a final list of duplicate document IDs for removal.
Description
IdentifyDuplicatesStage is implemented as a ProcessingStage[FileGroupTask, FileGroupTask] that reads pairwise similarity results produced by PairwiseStage and filters them using an epsilon threshold. Documents where cosine_sim_score >= (1.0 - eps) are identified as duplicates. The stage outputs a parquet file containing the IDs of documents to remove.
This is a CPU-only stage since it performs simple threshold filtering on pre-computed similarity scores, making it I/O bound rather than compute bound. It uses process_batch to handle input as a list of FileGroupTask objects, processing all cluster results together.
The optional _num_row_groups_hint parameter can be used to tune parquet read performance by hinting at the expected number of row groups.
Usage
IdentifyDuplicatesStage is the final stage in the Semantic Deduplication pipeline. It takes the pairwise similarity results and produces a deduplicated ID list. The output at output_path/duplicates/ contains parquet files with a single id column listing all document IDs that should be removed from the dataset.
Code Reference
Source Location
nemo_curator/stages/deduplication/semantic/identify_duplicates.py, lines 27-131.
Signature
@dataclass
class IdentifyDuplicatesStage(ProcessingStage[FileGroupTask, FileGroupTask]):
output_path: str
eps: float
_num_row_groups_hint: int | None = None
verbose: bool = False
Import
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | list[FileGroupTask] |
List of file group tasks pointing to pairwise similarity parquet files (output of PairwiseStage). Processed via process_batch. Each parquet contains columns: id, max_id, cosine_sim_score
|
| Output | FileGroupTask |
File group task pointing to parquet files at output_path/duplicates/ containing a single id column with the IDs of documents identified as duplicates
|
| Threshold Logic | Filter | Documents where cosine_sim_score >= (1.0 - eps) are marked as duplicates and their IDs are written to the output
|
Usage Examples
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
# Configure the duplicate identification stage with a moderate threshold
identify_stage = IdentifyDuplicatesStage(
output_path="/data/dedup_results/",
eps=0.05,
_num_row_groups_hint=None,
verbose=True,
)
# Execute the stage with pairwise results
# Input is a list of FileGroupTask from PairwiseStage output
identify_stage.run()
# Strict deduplication: only very high similarity pairs
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
strict_stage = IdentifyDuplicatesStage(
output_path="/data/strict_dedup/",
eps=0.01, # threshold = 0.99, very strict
)
# Permissive deduplication: catch more near-duplicates
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
permissive_stage = IdentifyDuplicatesStage(
output_path="/data/permissive_dedup/",
eps=0.10, # threshold = 0.90, more aggressive removal
verbose=True,
)
# Full Semantic Deduplication pipeline example
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
# Stage 1: Cluster embeddings
kmeans = KMeansStage(
n_clusters=1000,
id_field="doc_id",
embedding_field="embedding",
input_path="/data/embeddings/",
output_path="/data/clustered/",
)
# Stage 2: Compute pairwise similarity within clusters
pairwise = PairwiseStage(
id_field="doc_id",
embedding_field="embedding",
input_path="/data/clustered/",
output_path="/data/pairwise/",
which_to_keep="hard",
)
# Stage 3: Identify duplicates above threshold
identify = IdentifyDuplicatesStage(
output_path="/data/final_dedup/",
eps=0.05,
)
# Run pipeline stages sequentially
kmeans.run()
pairwise.run()
identify.run()
# Result: /data/final_dedup/duplicates/ contains parquet with IDs to remove
Related Pages
- Principle:NVIDIA_NeMo_Curator_Semantic_Duplicate_Identification
- Implementation:NVIDIA_NeMo_Curator_KMeansStage
- Implementation:NVIDIA_NeMo_Curator_PairwiseStage
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- Environment:NVIDIA_NeMo_Curator_RAPIDS_GPU_Stack
- Environment:NVIDIA_NeMo_Curator_Ray_Cluster
- Heuristic:NVIDIA_NeMo_Curator_Semantic_Dedup_Cluster_Sizing