Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator PairwiseStage

From Leeroopedia
Implementation Metadata
Knowledge Sources
Domains
Last Updated 2026-02-14 17:00 GMT

Overview

PairwiseStage is a composite pipeline stage that computes pairwise cosine similarity between all embedding vectors within each KMeans cluster to identify near-duplicate document pairs.

Description

PairwiseStage is implemented as a CompositeStage[_EmptyTask, FileGroupTask] that reads centroid-partitioned parquet files produced by KMeansStage and computes within-cluster pairwise similarity. For each cluster, it loads all embeddings, computes the batched NxN cosine similarity matrix, and for each document records its most similar neighbor and the similarity score.

The stage supports configurable ranking strategies via the which_to_keep parameter: "hard" keeps outliers (farthest from centroid), "easy" keeps centroid-proximate documents, and "random" selects randomly. The pairwise_batch_size parameter controls GPU memory usage during the matrix multiplication. An optional RankingStrategy object can be provided for custom ranking logic.

Usage

PairwiseStage is used as the second stage of the Semantic Deduplication pipeline, after KMeansStage has produced centroid-partitioned data. Its output is consumed by IdentifyDuplicatesStage which applies an epsilon threshold to filter duplicates.

Code Reference

Source Location

nemo_curator/stages/deduplication/semantic/pairwise.py, lines 256-325.

Signature

@dataclass
class PairwiseStage(CompositeStage[_EmptyTask, FileGroupTask]):
    id_field: str
    embedding_field: str
    input_path: str
    output_path: str
    ranking_strategy: RankingStrategy | None = None
    embedding_dim: int | None = None
    pairwise_batch_size: int = 1024
    verbose: bool = False
    which_to_keep: Literal["hard", "easy", "random"] = "hard"
    sim_metric: Literal["cosine", "l2"] = "cosine"

Import

from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage

I/O Contract

Direction Type Description
Input _EmptyTask No explicit input task; reads centroid-partitioned parquet files from input_path (output of KMeansStage, organized as input_path/centroid=N/)
Output FileGroupTask Pairwise results parquet files per cluster, each containing columns: id, max_id (ID of most similar neighbor), and cosine_sim_score (similarity to that neighbor)
Side Effects Disk I/O Writes pairwise similarity results to output_path as parquet files, one per cluster

Usage Examples

from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage

# Configure the Pairwise similarity stage
pairwise_stage = PairwiseStage(
    id_field="doc_id",
    embedding_field="embedding",
    input_path="/data/clustered/",
    output_path="/data/pairwise_results/",
    ranking_strategy=None,
    embedding_dim=768,
    pairwise_batch_size=1024,
    verbose=True,
    which_to_keep="hard",
    sim_metric="cosine",
)

# Execute the stage
pairwise_stage.run()
# Using "easy" ranking to keep centroid-proximate documents
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage

pairwise_stage = PairwiseStage(
    id_field="id",
    embedding_field="emb",
    input_path="/data/clustered/",
    output_path="/data/pairwise/",
    which_to_keep="easy",
    sim_metric="cosine",
)
# Using L2 distance metric instead of cosine similarity
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage

pairwise_stage = PairwiseStage(
    id_field="doc_id",
    embedding_field="embedding",
    input_path="/data/clustered/",
    output_path="/data/pairwise_l2/",
    pairwise_batch_size=512,
    which_to_keep="hard",
    sim_metric="l2",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment