Principle:NVIDIA NeMo Curator KMeans Clustering for Embeddings
| Principle Metadata | |
|---|---|
| Knowledge Sources | |
| Domains | |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
KMeans Clustering for Embeddings is a technique for partitioning embedding vectors into clusters using GPU-accelerated KMeans to enable efficient within-cluster pairwise similarity computation.
Description
KMeans clustering groups similar embeddings together, reducing the pairwise comparison space from O(n^2) to O(n*k) where k is the average cluster size. Instead of comparing every document against every other document in the corpus, documents are first assigned to their nearest centroid, and then pairwise similarity is computed only within each cluster. This dramatically reduces the computational cost while preserving the ability to identify near-duplicate pairs.
The implementation uses RAPIDS cuML with RAFT for multi-GPU distributed KMeans, enabling the clustering step to scale to billions of embeddings across a GPU cluster. Embeddings are read from parquet files, clustered, and then written out partitioned by centroid assignment (e.g., output_path/centroid=0/, output_path/centroid=1/, etc.), facilitating downstream per-cluster processing.
Usage
KMeans Clustering for Embeddings is used as the first major stage in the Semantic Deduplication (SemDeDup) pipeline. After embeddings have been computed for all documents, this stage partitions them into clusters so that pairwise similarity computation in the next stage only needs to compare documents within the same cluster. The number of clusters (n_clusters) is a key hyperparameter: too few clusters leads to large clusters and expensive pairwise computation; too many clusters may miss duplicates that land in different clusters.
Typical usage involves choosing n_clusters proportional to the square root of the dataset size, using k-means|| (scalable k-means++) initialization, and tuning max_iter and tol for convergence quality.
Theoretical Basis
KMeans minimizes the within-cluster sum of squares (WCSS), also known as inertia:
Objective: minimize sum over all clusters C_i of:
sum over all x in C_i of || x - mu_i ||^2
where mu_i is the centroid of cluster C_i.
The algorithm alternates between two steps:
- Assignment step: Each document embedding is assigned to the nearest centroid based on Euclidean distance.
- Update step: Each centroid is recomputed as the mean of all embeddings assigned to it.
This process repeats until convergence (change in centroids falls below tol) or max_iter iterations are reached.
Initialization uses k-means++ (specifically the scalable variant k-means||) which provides provably better initial centroids than random initialization. The oversampling_factor controls how many candidate centroids are sampled per round during initialization, with higher values improving initialization quality at the cost of additional computation.
# Pseudocode for the KMeans clustering stage
def kmeans_cluster(embeddings, n_clusters, max_iter, tol):
# 1. Initialize centroids using k-means|| (scalable k-means++)
centroids = kmeans_parallel_init(embeddings, n_clusters, oversampling_factor=2.0)
for iteration in range(max_iter):
# 2. Assignment: assign each embedding to nearest centroid
assignments = argmin_distance(embeddings, centroids) # GPU-accelerated
# 3. Update: recompute centroids
new_centroids = compute_means(embeddings, assignments, n_clusters)
# 4. Check convergence
if max_centroid_shift(centroids, new_centroids) < tol:
break
centroids = new_centroids
# 5. Write partitioned output
for cluster_id in range(n_clusters):
cluster_docs = embeddings[assignments == cluster_id]
write_parquet(cluster_docs, path=f"output/centroid={cluster_id}/")
return assignments, centroids
The key insight from the SemDeDup paper is that documents that are semantically similar will tend to be assigned to the same cluster, so restricting pairwise comparison to within-cluster pairs captures most true duplicates while achieving massive computational savings.