Implementation:NVIDIA NeMo Curator PairwiseStage
| Implementation Metadata | |
|---|---|
| Knowledge Sources | |
| Domains | |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
PairwiseStage is a composite pipeline stage that computes pairwise cosine similarity between all embedding vectors within each KMeans cluster to identify near-duplicate document pairs.
Description
PairwiseStage is implemented as a CompositeStage[_EmptyTask, FileGroupTask] that reads centroid-partitioned parquet files produced by KMeansStage and computes within-cluster pairwise similarity. For each cluster, it loads all embeddings, computes the batched NxN cosine similarity matrix, and for each document records its most similar neighbor and the similarity score.
The stage supports configurable ranking strategies via the which_to_keep parameter: "hard" keeps outliers (farthest from centroid), "easy" keeps centroid-proximate documents, and "random" selects randomly. The pairwise_batch_size parameter controls GPU memory usage during the matrix multiplication. An optional RankingStrategy object can be provided for custom ranking logic.
Usage
PairwiseStage is used as the second stage of the Semantic Deduplication pipeline, after KMeansStage has produced centroid-partitioned data. Its output is consumed by IdentifyDuplicatesStage which applies an epsilon threshold to filter duplicates.
Code Reference
Source Location
nemo_curator/stages/deduplication/semantic/pairwise.py, lines 256-325.
Signature
@dataclass
class PairwiseStage(CompositeStage[_EmptyTask, FileGroupTask]):
id_field: str
embedding_field: str
input_path: str
output_path: str
ranking_strategy: RankingStrategy | None = None
embedding_dim: int | None = None
pairwise_batch_size: int = 1024
verbose: bool = False
which_to_keep: Literal["hard", "easy", "random"] = "hard"
sim_metric: Literal["cosine", "l2"] = "cosine"
Import
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | _EmptyTask |
No explicit input task; reads centroid-partitioned parquet files from input_path (output of KMeansStage, organized as input_path/centroid=N/)
|
| Output | FileGroupTask |
Pairwise results parquet files per cluster, each containing columns: id, max_id (ID of most similar neighbor), and cosine_sim_score (similarity to that neighbor)
|
| Side Effects | Disk I/O | Writes pairwise similarity results to output_path as parquet files, one per cluster
|
Usage Examples
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
# Configure the Pairwise similarity stage
pairwise_stage = PairwiseStage(
id_field="doc_id",
embedding_field="embedding",
input_path="/data/clustered/",
output_path="/data/pairwise_results/",
ranking_strategy=None,
embedding_dim=768,
pairwise_batch_size=1024,
verbose=True,
which_to_keep="hard",
sim_metric="cosine",
)
# Execute the stage
pairwise_stage.run()
# Using "easy" ranking to keep centroid-proximate documents
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
pairwise_stage = PairwiseStage(
id_field="id",
embedding_field="emb",
input_path="/data/clustered/",
output_path="/data/pairwise/",
which_to_keep="easy",
sim_metric="cosine",
)
# Using L2 distance metric instead of cosine similarity
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
pairwise_stage = PairwiseStage(
id_field="doc_id",
embedding_field="embedding",
input_path="/data/clustered/",
output_path="/data/pairwise_l2/",
pairwise_batch_size=512,
which_to_keep="hard",
sim_metric="l2",
)
Related Pages
- Principle:NVIDIA_NeMo_Curator_Pairwise_Similarity_Computation
- Implementation:NVIDIA_NeMo_Curator_KMeansStage
- Implementation:NVIDIA_NeMo_Curator_Semantic_IdentifyDuplicatesStage
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- Environment:NVIDIA_NeMo_Curator_RAPIDS_GPU_Stack
- Environment:NVIDIA_NeMo_Curator_Ray_Cluster
- Heuristic:NVIDIA_NeMo_Curator_Semantic_Dedup_Cluster_Sizing