Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator KMeansStage

From Leeroopedia
Implementation Metadata
Knowledge Sources
Domains
Last Updated 2026-02-14 17:00 GMT

Overview

KMeansStage is a composite pipeline stage that performs GPU-accelerated KMeans clustering on document embeddings, partitioning them by centroid assignment to enable efficient within-cluster pairwise similarity computation.

Description

KMeansStage is implemented as a CompositeStage[_EmptyTask, _EmptyTask] that orchestrates the full KMeans clustering workflow. It reads embedding vectors from parquet or JSONL files, performs distributed KMeans clustering using RAPIDS cuML with RAFT across multiple GPUs, and writes the results as centroid-partitioned parquet files. Each output partition directory (output_path/centroid=N/) contains the documents assigned to that centroid, including their IDs, embeddings, and optional metadata fields.

The stage supports configurable initialization strategies (k-means|| or random), convergence parameters (max_iter, tol), and batching controls (max_samples_per_batch) to manage GPU memory usage for large datasets.

Usage

KMeansStage is used as the first stage in the Semantic Deduplication pipeline, after embeddings have been precomputed. It is typically followed by PairwiseStage which computes pairwise similarity within each cluster.

Code Reference

Source Location

nemo_curator/stages/deduplication/semantic/kmeans.py, lines 293-379.

Signature

@dataclass
class KMeansStage(CompositeStage[_EmptyTask, _EmptyTask]):
    n_clusters: int
    id_field: str
    embedding_field: str
    input_path: str | list[str]
    output_path: str
    metadata_fields: list[str] | None = None
    verbose: bool = False
    embedding_dim: int | None = None
    input_filetype: Literal["jsonl", "parquet"] = "parquet"
    max_iter: int = 300
    tol: float = 1e-4
    random_state: int = 42
    init: Literal["k-means||", "random"] = "k-means||"
    oversampling_factor: float = 2.0
    max_samples_per_batch: int = 32768

Import

from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage

I/O Contract

Direction Type Description
Input _EmptyTask No explicit input task; reads embeddings from input_path (parquet or JSONL files containing id_field and embedding_field columns)
Output _EmptyTask No explicit output task; side effect writes centroid-partitioned parquet to output_path/centroid=N/ directories
Side Effects Disk I/O Creates partitioned parquet files at output_path, one directory per centroid, each containing documents assigned to that cluster with their IDs, embeddings, and optional metadata fields

Usage Examples

from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage

# Configure the KMeans clustering stage
kmeans_stage = KMeansStage(
    n_clusters=1000,
    id_field="doc_id",
    embedding_field="embedding",
    input_path="/data/embeddings/",
    output_path="/data/clustered/",
    metadata_fields=["text", "source"],
    verbose=True,
    embedding_dim=768,
    input_filetype="parquet",
    max_iter=300,
    tol=1e-4,
    random_state=42,
    init="k-means||",
    oversampling_factor=2.0,
    max_samples_per_batch=32768,
)

# Execute the stage within a pipeline
# The stage reads from input_path and writes partitioned output to output_path
kmeans_stage.run()
# Minimal configuration example
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage

kmeans_stage = KMeansStage(
    n_clusters=500,
    id_field="id",
    embedding_field="emb",
    input_path="/data/docs.parquet",
    output_path="/data/kmeans_output/",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment