Heuristic:NVIDIA NeMo Curator Semantic Dedup Cluster Sizing

Knowledge Sources	NVIDIA NeMo Curator
Domains	Deduplication, Optimization, Memory_Management
Last Updated	2026-02-14 16:45 GMT

Overview

Use at least 1000 clusters for semantic deduplication to prevent out-of-memory errors, since each cluster must fit entirely in GPU memory during pairwise similarity computation.

Description

Semantic deduplication in NeMo Curator works by first clustering embeddings via KMeans, then computing pairwise cosine similarity within each cluster. Since the pairwise computation loads all embeddings in a cluster into GPU memory simultaneously, the cluster size directly determines peak memory usage. With too few clusters on a large dataset, individual clusters become too large to fit in memory, causing OOM failures.

Usage

Apply this heuristic when configuring the `SemanticDeduplicationWorkflow` or `KMeansStage` for large datasets (millions of documents or more). The `n_clusters` parameter controls how many clusters KMeans produces. If you see CUDA OOM errors during the pairwise similarity stage, increasing `n_clusters` is the primary mitigation.

The Insight (Rule of Thumb)

Action: Set `n_clusters >= 1000` when running semantic deduplication on large datasets.
Value: `MIN_RECOMMENDED_N_CLUSTERS = 1000` (defined in source code).
Trade-off: More clusters means smaller clusters (lower OOM risk) but slightly slower KMeans convergence and potentially less meaningful cluster boundaries.
Scaling rule: For very large datasets (100M+ documents), consider scaling `n_clusters` proportionally. Each cluster should contain at most a few thousand documents to stay within GPU memory bounds.

Reasoning

The pairwise similarity stage computes an N x N cosine similarity matrix for each cluster, where N is the number of embeddings in the cluster. The memory complexity is O(N^2) per cluster. With `n_clusters=100` on a 10M document dataset, each cluster averages 100K documents — the pairwise matrix alone would require ~37GB of GPU memory (100K x 100K x float32). With `n_clusters=1000`, clusters average 10K documents, requiring only ~0.37GB per pairwise matrix.

The NeMo Curator source code enforces this recommendation via a runtime warning when `n_clusters < 1000`:

# From nemo_curator/stages/deduplication/semantic/workflow.py:44-45, 202-207
MIN_RECOMMENDED_N_CLUSTERS = 1000

if self.n_clusters < MIN_RECOMMENDED_N_CLUSTERS:
    logger.warning(
        f"n_clusters={self.n_clusters} is less than {MIN_RECOMMENDED_N_CLUSTERS}. "
        "For large datasets, this may result in out-of-memory errors since "
        f"each cluster must fit in memory. Consider using n_clusters >= {MIN_RECOMMENDED_N_CLUSTERS} for large datasets."
    )

Additionally, the KMeans stage uses `max_samples_per_batch = 1 << 15` (32,768) to limit batch sizes during clustering itself (`nemo_curator/stages/deduplication/semantic/kmeans.py:86`).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment