Principle:NVIDIA NeMo Curator Pairwise Similarity Computation
| Principle Metadata | |
|---|---|
| Knowledge Sources | |
| Domains | |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Pairwise Similarity Computation is a technique for computing pairwise cosine similarity between all embedding vectors within each cluster to identify near-duplicate pairs.
Description
Within each KMeans cluster, this technique computes an NxN cosine similarity matrix in batches. For each document in the cluster, it identifies the most similar other document and records the similarity score. This produces a per-document record containing the document ID, the ID of its nearest neighbor, and the cosine similarity score between them.
The computation is performed in batches to avoid materializing the full NxN similarity matrix in GPU memory. For a cluster of size N and batch size B, the algorithm processes B rows at a time, computing a B x N similarity sub-matrix, extracting the maximum similarity for each of the B query documents (excluding self-similarity).
A ranking strategy determines which member of a duplicate pair to keep:
- Hard: Keep the document that is farthest from the cluster centroid (the "outlier" or more unique document).
- Easy: Keep the document that is closest to the cluster centroid (the more typical document).
- Random: Randomly select which document to keep.
Usage
Pairwise Similarity Computation is the second stage of the Semantic Deduplication pipeline, executed after KMeans clustering has partitioned the embeddings. It processes each cluster independently, making it embarrassingly parallel across clusters. The output is consumed by the Semantic Duplicate Identification stage, which applies an epsilon threshold to determine which documents to remove.
The pairwise_batch_size parameter controls the memory/speed tradeoff: larger batches are faster but require more GPU memory. The sim_metric parameter selects between cosine similarity and L2 distance.
Theoretical Basis
Cosine similarity between two vectors a and b is defined as:
sim(a, b) = (a . b) / (|a| * |b|)
where:
a . b = sum of element-wise products (dot product)
|a| = sqrt(sum of squared elements) (L2 norm)
For normalized embeddings (unit vectors), cosine similarity simplifies to the dot product:
If |a| = |b| = 1, then sim(a, b) = a . b
Batched computation avoids O(n^2) memory by processing batch_size rows at a time:
# Pseudocode for batched pairwise similarity within a cluster
def compute_pairwise_similarity(embeddings, ids, centroid, batch_size):
n = len(embeddings)
results = []
# Normalize embeddings for cosine similarity
norms = compute_l2_norms(embeddings)
normalized = embeddings / norms[:, None]
# Compute distance from centroid for ranking
centroid_distances = compute_distances(embeddings, centroid)
for start in range(0, n, batch_size):
end = min(start + batch_size, n)
batch = normalized[start:end] # shape: (B, dim)
# Compute similarity: (B, dim) x (dim, N) -> (B, N)
sim_matrix = batch @ normalized.T
# Mask self-similarity by setting diagonal to -inf
for i in range(end - start):
sim_matrix[i, start + i] = -float('inf')
# Find most similar document for each in batch
max_sim_indices = argmax(sim_matrix, axis=1)
max_sim_scores = max(sim_matrix, axis=1)
for i in range(end - start):
results.append({
'id': ids[start + i],
'max_id': ids[max_sim_indices[i]],
'cosine_sim_score': max_sim_scores[i],
'centroid_distance': centroid_distances[start + i],
})
return results
Ranking strategies determine which document in a near-duplicate pair to keep:
Given a duplicate pair (doc_a, doc_b) with sim >= threshold:
- "hard": keep the document with LARGER centroid distance (outlier)
- "easy": keep the document with SMALLER centroid distance (typical)
- "random": keep a randomly selected document
The "hard" strategy is recommended by the SemDeDup paper as it tends to preserve more informative, unique examples while removing redundant near-copies of common patterns.