Principle:NVIDIA NeMo Curator Semantic Duplicate Identification

Principle Metadata
Knowledge Sources	SemDeDup
Domains	Data_Curation Deduplication
Last Updated	2026-02-14 17:00 GMT

Overview

Semantic Duplicate Identification is a technique for identifying documents to remove based on pairwise similarity scores exceeding an epsilon threshold.

Description

This technique applies an epsilon threshold to pairwise similarity scores produced by the Pairwise Similarity Computation stage. Documents with cosine_sim_score >= (1.0 - eps) are marked as semantic duplicates. The stage selects which document in each near-duplicate pair to remove based on the ranking strategy established in the pairwise stage (keep the harder, easier, or randomly chosen document).

The epsilon parameter provides a tunable knob for controlling the precision-recall tradeoff in duplicate detection:

Small epsilon (e.g., 0.01): Strict threshold, only very similar documents are flagged. Fewer duplicates removed, lower risk of false positives.
Large epsilon (e.g., 0.1): Permissive threshold, moderately similar documents are also flagged. More duplicates removed, higher risk of removing distinct content.

This is a CPU-only stage since the computation is I/O bound -- it reads pairwise similarity scores from parquet files and applies a simple threshold filter, requiring no GPU acceleration.

Usage

Semantic Duplicate Identification is the third and final stage of the Semantic Deduplication pipeline. It consumes the pairwise similarity results and produces a list of document IDs to remove. The output can then be used to filter the original dataset, producing a deduplicated corpus for downstream model training.

The eps parameter should be tuned based on the specific dataset and use case. The SemDeDup paper recommends evaluating downstream model performance at several epsilon values to find the optimal balance between data reduction and training quality.

Theoretical Basis

Given pairwise similarity scores, the duplicate identification decision is:

For each document pair (a, b) with similarity score s:
    duplicate if s >= (1.0 - eps)

Equivalently:
    duplicate if (1.0 - s) <= eps

where:
    s    = cosine similarity between a and b (from pairwise stage)
    eps  = epsilon threshold (user-configured, typically 0.01 to 0.1)

The epsilon controls the precision/recall tradeoff:

eps -> 0:  threshold -> 1.0  (only exact duplicates, high precision, low recall)
eps -> 1:  threshold -> 0.0  (everything is a duplicate, low precision, high recall)

Typical values:
  eps = 0.01  =>  threshold = 0.99  (very strict)
  eps = 0.05  =>  threshold = 0.95  (moderate)
  eps = 0.10  =>  threshold = 0.90  (permissive)

The identification algorithm processes the pairwise results as follows:

# Pseudocode for semantic duplicate identification
def identify_duplicates(pairwise_results, eps):
    """
    pairwise_results: list of dicts with keys
        'id', 'max_id', 'cosine_sim_score'
    eps: epsilon threshold (float)
    """
    duplicates_to_remove = set()
    threshold = 1.0 - eps

    for record in pairwise_results:
        if record['cosine_sim_score'] >= threshold:
            # This document has a near-duplicate with similarity above threshold.
            # The ranking strategy (hard/easy/random) already determined which
            # document in the pair is the "keeper" during the pairwise stage.
            # The 'id' field here is the document to be REMOVED.
            duplicates_to_remove.add(record['id'])

    return duplicates_to_remove

Because the ranking decision (which document in a pair to keep) was already made during the pairwise stage, this stage only needs to apply the threshold and collect the IDs of documents to remove. This makes it computationally inexpensive and suitable for CPU-only execution.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment