Principle:Rapidsai Cuml Cluster Label Assignment
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Clustering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Cluster label assignment is the process of mapping each data point to a discrete cluster identifier (and optionally a confidence score), where the assignment strategy varies fundamentally across algorithm families: hard inductive assignment, transductive-only labeling, and soft probabilistic approximate inference.
Description
After a clustering model has been fitted, the next critical step is assigning cluster labels to data points. Different algorithms support different modes of label assignment:
Hard Inductive Assignment (KMeans): KMeans supports true out-of-sample prediction. Given a fitted model with k centroids, any new data point is assigned to the cluster whose centroid is nearest:
label(x) = argmin_j ||x - mu_j||^2
This is a hard assignment (each point belongs to exactly one cluster) and it is inductive (it generalizes to new, unseen data without refitting). The predict operation is independent of the training data and depends only on the learned centroids.
Transductive-Only Labeling (DBSCAN): DBSCAN does not have a separate predict method. Labels are produced only during fitting through density connectivity analysis. The fit_predict operation is the standard interface: it fits the model and returns labels in a single step. Points that are not density-reachable from any core point receive the noise label (-1). Since DBSCAN does not learn a parametric model (there are no stored centroids or decision boundaries), applying it to new data requires refitting on the combined dataset.
Soft Probabilistic + Approximate Inference (HDBSCAN): HDBSCAN produces both hard labels and soft membership probabilities during fitting. For new unseen data, it supports approximate prediction through the approximate_predict function. This uses cached prediction data (the condensed tree, core distances, and cluster exemplars) to assign new points to existing clusters by:
- Computing distances from the new point to cluster exemplars.
- Determining the best matching cluster based on the lambda value at which the point would fall out of each cluster.
- Returning both a predicted label and a confidence probability.
This is approximate because it assigns new points to the existing cluster structure without rebuilding the hierarchy. The fit_predict interface provides the standard transductive pathway.
Usage
Label assignment is used whenever cluster memberships are needed:
- KMeans.predict: Assign new data points to existing clusters after fitting. Suitable for streaming, batch inference, or train/test splits.
- DBSCAN.fit_predict: Obtain cluster labels for a dataset in one step. Use when the goal is to label the training data itself and no out-of-sample prediction is needed.
- HDBSCAN.fit_predict: Obtain cluster labels and probabilities for the training data.
- HDBSCAN.approximate_predict: Assign new points to an existing HDBSCAN clustering without refitting. Requires that prediction_data=True was set during fitting or that generate_prediction_data() was called post-fit.
Theoretical Basis
KMeans Hard Assignment: The prediction function computes a distance matrix between the new points and all k centroids, then takes the argmin along the centroid axis. The computational cost is O(n_new * k * d). On GPU, this is implemented as a batched matrix operation.
DBSCAN Transductive Labeling: DBSCAN labels are a by-product of the density connectivity graph traversal during fitting. The label of a point is determined by which connected component of core points it is density-reachable from. There is no closed-form assignment function for new points because the graph structure depends on the entire dataset.
HDBSCAN Approximate Prediction: The approximate_predict algorithm works as follows:
For each new point x_new:
1. Compute mutual reachability distances to all cluster exemplars.
2. For each candidate cluster c:
lambda_c(x_new) = 1 / d_mreach(x_new, nearest_exemplar_in_c)
3. Find the cluster c* with the highest lambda at which x_new
would still be a member (based on the condensed tree structure).
4. label(x_new) = c*
5. prob(x_new) = min(lambda_c*(x_new) / lambda_max(c*), 1.0)
Points that do not fit into any cluster at any persistence level are assigned the noise label (-1) with probability 0.