Workflow:NVIDIA NeMo Curator Semantic Deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Deduplication, Machine_Learning |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
End-to-end process for detecting and removing semantically similar documents using embedding-based clustering and pairwise similarity computation with GPU acceleration.
Description
This workflow implements the SemanticDeduplicationWorkflow which identifies semantically duplicate documents based on their embedding vectors rather than surface-level text similarity. The process clusters document embeddings using GPU-accelerated KMeans, computes pairwise cosine similarity within each cluster, ranks documents using a configurable strategy, and optionally identifies duplicates above a similarity threshold. Unlike fuzzy deduplication which detects near-identical text, semantic deduplication detects documents that express the same meaning using different words. The workflow is implemented as a WorkflowBase subclass that orchestrates two internal sub-pipelines: a KMeans clustering pipeline and a pairwise similarity pipeline. The KMeans stage always uses RayActorPoolExecutor while the pairwise stage can use either XennaExecutor or RayActorPoolExecutor.
Usage
Execute this workflow when you have pre-computed document embeddings and need to identify semantically redundant content in your dataset. This is typically used after exact and fuzzy deduplication to catch documents that convey the same information but are written differently. It requires pre-computed embedding vectors (e.g., from a sentence transformer or vLLM embedding model) stored as Parquet files with an ID field and an embeddings field.
Execution Steps
Step 1: KMeans Clustering
Cluster document embeddings into groups using GPU-accelerated KMeans. The KMeansStage reads Parquet files containing document IDs and embedding vectors from the input path, performs distributed KMeans clustering using RAPIDS cuML, and writes the data partitioned by centroid assignment to the cache directory. Each cluster contains documents whose embeddings are geometrically close in the embedding space. The number of clusters (n_clusters) is a critical parameter that controls both the granularity of deduplication and memory requirements.
Key considerations:
- A minimum of 1000 clusters is recommended for large datasets to avoid out-of-memory errors
- KMeans uses GPU-accelerated cuML with configurable max iterations, tolerance, and initialization
- The k-means|| initialization method is used by default for better convergence
- Input files must contain an ID field and an embeddings field in Parquet format
- Output is partitioned by centroid: each partition contains documents assigned to that cluster
- Always executes on RayActorPoolExecutor regardless of the pairwise executor choice
Step 2: Pairwise Similarity Computation
Compute pairwise cosine similarity between all document embeddings within each cluster and rank documents. The PairwiseStage reads the cluster-partitioned data from the KMeans output, computes pairwise distance matrices within each cluster, and ranks documents according to a configurable strategy. The ranking determines which documents to keep when duplicates are found: hard keeps the document furthest from the centroid, easy keeps the one closest, and random selects randomly. A custom RankingStrategy can be provided for domain-specific ranking logic that incorporates metadata fields.
Key considerations:
- Pairwise computation is O(n^2) within each cluster, so cluster size must fit in GPU memory
- Cosine and L2 distance metrics are supported
- Batch size for pairwise computation is configurable (default: 1024)
- The ranking strategy determines which document survives in each duplicate pair
- Custom ranking strategies can use metadata fields for domain-specific selection
- Output similarity scores are written to cache_path/pairwise_results/
Step 3: Duplicate Identification
Identify document pairs whose similarity exceeds the configured epsilon threshold and generate removal IDs. The IdentifyDuplicatesStage reads the pairwise similarity scores and applies the epsilon threshold to determine which document pairs are duplicates. For each duplicate pair, the lower-ranked document (per the ranking strategy) is marked for removal. The final removal IDs are written to the output directory.
Key considerations:
- This step only executes if an epsilon value is provided during workflow configuration
- Epsilon represents the maximum distance (1 - similarity) for two documents to be considered duplicates
- A smaller epsilon means stricter duplicate detection (higher similarity required)
- Output removal IDs are written to output_path/duplicates/ as Parquet files
- If no epsilon is provided, the workflow outputs pairwise scores without identifying duplicates
- The TextDuplicatesRemovalWorkflow can be used to physically remove identified duplicates