Workflow:NVIDIA NeMo Curator Semantic Deduplication

Knowledge Sources	NeMo Curator NeMo Curator Docs
Domains	Data_Engineering, NLP, Deduplication, Machine_Learning
Last Updated	2026-02-14 17:00 GMT

Overview

End-to-end process for detecting and removing semantically similar documents using embedding-based clustering and pairwise similarity computation with GPU acceleration.

Description

This workflow implements the SemanticDeduplicationWorkflow which identifies semantically duplicate documents based on their embedding vectors rather than surface-level text similarity. The process clusters document embeddings using GPU-accelerated KMeans, computes pairwise cosine similarity within each cluster, ranks documents using a configurable strategy, and optionally identifies duplicates above a similarity threshold. Unlike fuzzy deduplication which detects near-identical text, semantic deduplication detects documents that express the same meaning using different words. The workflow is implemented as a WorkflowBase subclass that orchestrates two internal sub-pipelines: a KMeans clustering pipeline and a pairwise similarity pipeline. The KMeans stage always uses RayActorPoolExecutor while the pairwise stage can use either XennaExecutor or RayActorPoolExecutor.

Usage

Execute this workflow when you have pre-computed document embeddings and need to identify semantically redundant content in your dataset. This is typically used after exact and fuzzy deduplication to catch documents that convey the same information but are written differently. It requires pre-computed embedding vectors (e.g., from a sentence transformer or vLLM embedding model) stored as Parquet files with an ID field and an embeddings field.

Execution Steps

Step 1: KMeans Clustering

Cluster document embeddings into groups using GPU-accelerated KMeans. The KMeansStage reads Parquet files containing document IDs and embedding vectors from the input path, performs distributed KMeans clustering using RAPIDS cuML, and writes the data partitioned by centroid assignment to the cache directory. Each cluster contains documents whose embeddings are geometrically close in the embedding space. The number of clusters (n_clusters) is a critical parameter that controls both the granularity of deduplication and memory requirements.

Key considerations:

A minimum of 1000 clusters is recommended for large datasets to avoid out-of-memory errors
KMeans uses GPU-accelerated cuML with configurable max iterations, tolerance, and initialization
The k-means|| initialization method is used by default for better convergence
Input files must contain an ID field and an embeddings field in Parquet format
Output is partitioned by centroid: each partition contains documents assigned to that cluster
Always executes on RayActorPoolExecutor regardless of the pairwise executor choice

Step 2: Pairwise Similarity Computation

Compute pairwise cosine similarity between all document embeddings within each cluster and rank documents. The PairwiseStage reads the cluster-partitioned data from the KMeans output, computes pairwise distance matrices within each cluster, and ranks documents according to a configurable strategy. The ranking determines which documents to keep when duplicates are found: hard keeps the document furthest from the centroid, easy keeps the one closest, and random selects randomly. A custom RankingStrategy can be provided for domain-specific ranking logic that incorporates metadata fields.

Key considerations:

Pairwise computation is O(n^2) within each cluster, so cluster size must fit in GPU memory
Cosine and L2 distance metrics are supported
Batch size for pairwise computation is configurable (default: 1024)
The ranking strategy determines which document survives in each duplicate pair
Custom ranking strategies can use metadata fields for domain-specific selection
Output similarity scores are written to cache_path/pairwise_results/

Step 3: Duplicate Identification

Identify document pairs whose similarity exceeds the configured epsilon threshold and generate removal IDs. The IdentifyDuplicatesStage reads the pairwise similarity scores and applies the epsilon threshold to determine which document pairs are duplicates. For each duplicate pair, the lower-ranked document (per the ranking strategy) is marked for removal. The final removal IDs are written to the output directory.

Key considerations:

This step only executes if an epsilon value is provided during workflow configuration
Epsilon represents the maximum distance (1 - similarity) for two documents to be considered duplicates
A smaller epsilon means stricter duplicate detection (higher similarity required)
Output removal IDs are written to output_path/duplicates/ as Parquet files
If no epsilon is provided, the workflow outputs pairwise scores without identifying duplicates
The TextDuplicatesRemovalWorkflow can be used to physically remove identified duplicates

Execution Diagram

GitHub URL

Workflow Repository