Implementation:NVIDIA NeMo Curator Semantic IdentifyDuplicatesStage

Implementation Metadata
Knowledge Sources	SemDeDup NeMo Curator
Domains	Data_Curation Deduplication
Last Updated	2026-02-14 17:00 GMT

Overview

IdentifyDuplicatesStage is a processing stage that applies an epsilon threshold to pairwise similarity scores to produce a final list of duplicate document IDs for removal.

Description

IdentifyDuplicatesStage is implemented as a ProcessingStage[FileGroupTask, FileGroupTask] that reads pairwise similarity results produced by PairwiseStage and filters them using an epsilon threshold. Documents where cosine_sim_score >= (1.0 - eps) are identified as duplicates. The stage outputs a parquet file containing the IDs of documents to remove.

This is a CPU-only stage since it performs simple threshold filtering on pre-computed similarity scores, making it I/O bound rather than compute bound. It uses process_batch to handle input as a list of FileGroupTask objects, processing all cluster results together.

The optional _num_row_groups_hint parameter can be used to tune parquet read performance by hinting at the expected number of row groups.

Usage

IdentifyDuplicatesStage is the final stage in the Semantic Deduplication pipeline. It takes the pairwise similarity results and produces a deduplicated ID list. The output at output_path/duplicates/ contains parquet files with a single id column listing all document IDs that should be removed from the dataset.

Code Reference

Source Location

nemo_curator/stages/deduplication/semantic/identify_duplicates.py, lines 27-131.

Signature

@dataclass
class IdentifyDuplicatesStage(ProcessingStage[FileGroupTask, FileGroupTask]):
    output_path: str
    eps: float
    _num_row_groups_hint: int | None = None
    verbose: bool = False

Import

from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage

I/O Contract

Direction	Type	Description
Input	`list[FileGroupTask]`	List of file group tasks pointing to pairwise similarity parquet files (output of `PairwiseStage`). Processed via `process_batch`. Each parquet contains columns: `id`, `max_id`, `cosine_sim_score`
Output	`FileGroupTask`	File group task pointing to parquet files at `output_path/duplicates/` containing a single `id` column with the IDs of documents identified as duplicates
Threshold Logic	Filter	Documents where `cosine_sim_score >= (1.0 - eps)` are marked as duplicates and their IDs are written to the output

Usage Examples

from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage

# Configure the duplicate identification stage with a moderate threshold
identify_stage = IdentifyDuplicatesStage(
    output_path="/data/dedup_results/",
    eps=0.05,
    _num_row_groups_hint=None,
    verbose=True,
)

# Execute the stage with pairwise results
# Input is a list of FileGroupTask from PairwiseStage output
identify_stage.run()

# Strict deduplication: only very high similarity pairs
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage

strict_stage = IdentifyDuplicatesStage(
    output_path="/data/strict_dedup/",
    eps=0.01,  # threshold = 0.99, very strict
)

# Permissive deduplication: catch more near-duplicates
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage

permissive_stage = IdentifyDuplicatesStage(
    output_path="/data/permissive_dedup/",
    eps=0.10,  # threshold = 0.90, more aggressive removal
    verbose=True,
)

# Full Semantic Deduplication pipeline example
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage

# Stage 1: Cluster embeddings
kmeans = KMeansStage(
    n_clusters=1000,
    id_field="doc_id",
    embedding_field="embedding",
    input_path="/data/embeddings/",
    output_path="/data/clustered/",
)

# Stage 2: Compute pairwise similarity within clusters
pairwise = PairwiseStage(
    id_field="doc_id",
    embedding_field="embedding",
    input_path="/data/clustered/",
    output_path="/data/pairwise/",
    which_to_keep="hard",
)

# Stage 3: Identify duplicates above threshold
identify = IdentifyDuplicatesStage(
    output_path="/data/final_dedup/",
    eps=0.05,
)

# Run pipeline stages sequentially
kmeans.run()
pairwise.run()
identify.run()

# Result: /data/final_dedup/duplicates/ contains parquet with IDs to remove

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment