Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Curator Pairwise Similarity Computation

From Leeroopedia
Principle Metadata
Knowledge Sources
Domains
Last Updated 2026-02-14 17:00 GMT

Overview

Pairwise Similarity Computation is a technique for computing pairwise cosine similarity between all embedding vectors within each cluster to identify near-duplicate pairs.

Description

Within each KMeans cluster, this technique computes an NxN cosine similarity matrix in batches. For each document in the cluster, it identifies the most similar other document and records the similarity score. This produces a per-document record containing the document ID, the ID of its nearest neighbor, and the cosine similarity score between them.

The computation is performed in batches to avoid materializing the full NxN similarity matrix in GPU memory. For a cluster of size N and batch size B, the algorithm processes B rows at a time, computing a B x N similarity sub-matrix, extracting the maximum similarity for each of the B query documents (excluding self-similarity).

A ranking strategy determines which member of a duplicate pair to keep:

  • Hard: Keep the document that is farthest from the cluster centroid (the "outlier" or more unique document).
  • Easy: Keep the document that is closest to the cluster centroid (the more typical document).
  • Random: Randomly select which document to keep.

Usage

Pairwise Similarity Computation is the second stage of the Semantic Deduplication pipeline, executed after KMeans clustering has partitioned the embeddings. It processes each cluster independently, making it embarrassingly parallel across clusters. The output is consumed by the Semantic Duplicate Identification stage, which applies an epsilon threshold to determine which documents to remove.

The pairwise_batch_size parameter controls the memory/speed tradeoff: larger batches are faster but require more GPU memory. The sim_metric parameter selects between cosine similarity and L2 distance.

Theoretical Basis

Cosine similarity between two vectors a and b is defined as:

sim(a, b) = (a . b) / (|a| * |b|)

where:
  a . b   = sum of element-wise products (dot product)
  |a|     = sqrt(sum of squared elements) (L2 norm)

For normalized embeddings (unit vectors), cosine similarity simplifies to the dot product:

If |a| = |b| = 1, then sim(a, b) = a . b

Batched computation avoids O(n^2) memory by processing batch_size rows at a time:

# Pseudocode for batched pairwise similarity within a cluster
def compute_pairwise_similarity(embeddings, ids, centroid, batch_size):
    n = len(embeddings)
    results = []

    # Normalize embeddings for cosine similarity
    norms = compute_l2_norms(embeddings)
    normalized = embeddings / norms[:, None]

    # Compute distance from centroid for ranking
    centroid_distances = compute_distances(embeddings, centroid)

    for start in range(0, n, batch_size):
        end = min(start + batch_size, n)
        batch = normalized[start:end]       # shape: (B, dim)

        # Compute similarity: (B, dim) x (dim, N) -> (B, N)
        sim_matrix = batch @ normalized.T

        # Mask self-similarity by setting diagonal to -inf
        for i in range(end - start):
            sim_matrix[i, start + i] = -float('inf')

        # Find most similar document for each in batch
        max_sim_indices = argmax(sim_matrix, axis=1)
        max_sim_scores = max(sim_matrix, axis=1)

        for i in range(end - start):
            results.append({
                'id': ids[start + i],
                'max_id': ids[max_sim_indices[i]],
                'cosine_sim_score': max_sim_scores[i],
                'centroid_distance': centroid_distances[start + i],
            })

    return results

Ranking strategies determine which document in a near-duplicate pair to keep:

Given a duplicate pair (doc_a, doc_b) with sim >= threshold:
  - "hard":   keep the document with LARGER centroid distance (outlier)
  - "easy":   keep the document with SMALLER centroid distance (typical)
  - "random": keep a randomly selected document

The "hard" strategy is recommended by the SemDeDup paper as it tends to preserve more informative, unique examples while removing redundant near-copies of common patterns.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment