Implementation:NVIDIA NeMo Curator KMeansStage
| Implementation Metadata | |
|---|---|
| Knowledge Sources | |
| Domains | |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
KMeansStage is a composite pipeline stage that performs GPU-accelerated KMeans clustering on document embeddings, partitioning them by centroid assignment to enable efficient within-cluster pairwise similarity computation.
Description
KMeansStage is implemented as a CompositeStage[_EmptyTask, _EmptyTask] that orchestrates the full KMeans clustering workflow. It reads embedding vectors from parquet or JSONL files, performs distributed KMeans clustering using RAPIDS cuML with RAFT across multiple GPUs, and writes the results as centroid-partitioned parquet files. Each output partition directory (output_path/centroid=N/) contains the documents assigned to that centroid, including their IDs, embeddings, and optional metadata fields.
The stage supports configurable initialization strategies (k-means|| or random), convergence parameters (max_iter, tol), and batching controls (max_samples_per_batch) to manage GPU memory usage for large datasets.
Usage
KMeansStage is used as the first stage in the Semantic Deduplication pipeline, after embeddings have been precomputed. It is typically followed by PairwiseStage which computes pairwise similarity within each cluster.
Code Reference
Source Location
nemo_curator/stages/deduplication/semantic/kmeans.py, lines 293-379.
Signature
@dataclass
class KMeansStage(CompositeStage[_EmptyTask, _EmptyTask]):
n_clusters: int
id_field: str
embedding_field: str
input_path: str | list[str]
output_path: str
metadata_fields: list[str] | None = None
verbose: bool = False
embedding_dim: int | None = None
input_filetype: Literal["jsonl", "parquet"] = "parquet"
max_iter: int = 300
tol: float = 1e-4
random_state: int = 42
init: Literal["k-means||", "random"] = "k-means||"
oversampling_factor: float = 2.0
max_samples_per_batch: int = 32768
Import
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | _EmptyTask |
No explicit input task; reads embeddings from input_path (parquet or JSONL files containing id_field and embedding_field columns)
|
| Output | _EmptyTask |
No explicit output task; side effect writes centroid-partitioned parquet to output_path/centroid=N/ directories
|
| Side Effects | Disk I/O | Creates partitioned parquet files at output_path, one directory per centroid, each containing documents assigned to that cluster with their IDs, embeddings, and optional metadata fields
|
Usage Examples
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
# Configure the KMeans clustering stage
kmeans_stage = KMeansStage(
n_clusters=1000,
id_field="doc_id",
embedding_field="embedding",
input_path="/data/embeddings/",
output_path="/data/clustered/",
metadata_fields=["text", "source"],
verbose=True,
embedding_dim=768,
input_filetype="parquet",
max_iter=300,
tol=1e-4,
random_state=42,
init="k-means||",
oversampling_factor=2.0,
max_samples_per_batch=32768,
)
# Execute the stage within a pipeline
# The stage reads from input_path and writes partitioned output to output_path
kmeans_stage.run()
# Minimal configuration example
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
kmeans_stage = KMeansStage(
n_clusters=500,
id_field="id",
embedding_field="emb",
input_path="/data/docs.parquet",
output_path="/data/kmeans_output/",
)
Related Pages
- Principle:NVIDIA_NeMo_Curator_KMeans_Clustering_for_Embeddings
- Implementation:NVIDIA_NeMo_Curator_PairwiseStage
- Implementation:NVIDIA_NeMo_Curator_Semantic_IdentifyDuplicatesStage
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- Environment:NVIDIA_NeMo_Curator_RAPIDS_GPU_Stack
- Environment:NVIDIA_NeMo_Curator_Ray_Cluster
- Heuristic:NVIDIA_NeMo_Curator_Semantic_Dedup_Cluster_Sizing