Implementation:NVIDIA NeMo Curator PairwiseStage

Implementation Metadata
Knowledge Sources	SemDeDup NeMo Curator
Domains	Data_Curation Deduplication Linear_Algebra
Last Updated	2026-02-14 17:00 GMT

Overview

PairwiseStage is a composite pipeline stage that computes pairwise cosine similarity between all embedding vectors within each KMeans cluster to identify near-duplicate document pairs.

Description

PairwiseStage is implemented as a CompositeStage[_EmptyTask, FileGroupTask] that reads centroid-partitioned parquet files produced by KMeansStage and computes within-cluster pairwise similarity. For each cluster, it loads all embeddings, computes the batched NxN cosine similarity matrix, and for each document records its most similar neighbor and the similarity score.

The stage supports configurable ranking strategies via the which_to_keep parameter: "hard" keeps outliers (farthest from centroid), "easy" keeps centroid-proximate documents, and "random" selects randomly. The pairwise_batch_size parameter controls GPU memory usage during the matrix multiplication. An optional RankingStrategy object can be provided for custom ranking logic.

Usage

PairwiseStage is used as the second stage of the Semantic Deduplication pipeline, after KMeansStage has produced centroid-partitioned data. Its output is consumed by IdentifyDuplicatesStage which applies an epsilon threshold to filter duplicates.

Code Reference

Source Location

nemo_curator/stages/deduplication/semantic/pairwise.py, lines 256-325.

Signature

@dataclass
class PairwiseStage(CompositeStage[_EmptyTask, FileGroupTask]):
    id_field: str
    embedding_field: str
    input_path: str
    output_path: str
    ranking_strategy: RankingStrategy | None = None
    embedding_dim: int | None = None
    pairwise_batch_size: int = 1024
    verbose: bool = False
    which_to_keep: Literal["hard", "easy", "random"] = "hard"
    sim_metric: Literal["cosine", "l2"] = "cosine"

Import

from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage

I/O Contract

Direction	Type	Description
Input	`_EmptyTask`	No explicit input task; reads centroid-partitioned parquet files from `input_path` (output of `KMeansStage`, organized as `input_path/centroid=N/`)
Output	`FileGroupTask`	Pairwise results parquet files per cluster, each containing columns: `id`, `max_id` (ID of most similar neighbor), and `cosine_sim_score` (similarity to that neighbor)
Side Effects	Disk I/O	Writes pairwise similarity results to `output_path` as parquet files, one per cluster

Usage Examples

from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage

# Configure the Pairwise similarity stage
pairwise_stage = PairwiseStage(
    id_field="doc_id",
    embedding_field="embedding",
    input_path="/data/clustered/",
    output_path="/data/pairwise_results/",
    ranking_strategy=None,
    embedding_dim=768,
    pairwise_batch_size=1024,
    verbose=True,
    which_to_keep="hard",
    sim_metric="cosine",
)

# Execute the stage
pairwise_stage.run()

# Using "easy" ranking to keep centroid-proximate documents
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage

pairwise_stage = PairwiseStage(
    id_field="id",
    embedding_field="emb",
    input_path="/data/clustered/",
    output_path="/data/pairwise/",
    which_to_keep="easy",
    sim_metric="cosine",
)

# Using L2 distance metric instead of cosine similarity
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage

pairwise_stage = PairwiseStage(
    id_field="doc_id",
    embedding_field="embedding",
    input_path="/data/clustered/",
    output_path="/data/pairwise_l2/",
    pairwise_batch_size=512,
    which_to_keep="hard",
    sim_metric="l2",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment