Implementation:Rapidsai Cuml KMeans DBSCAN HDBSCAN Fit
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Clustering, GPU_Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for fitting KMeans, DBSCAN, and HDBSCAN clustering models on GPU using the cuML library.
Description
These `fit()` methods execute the core clustering algorithms on GPU:
- KMeans.fit runs Lloyd's iterative algorithm on GPU via RAFT C++ kernels, computing cluster centers by alternating between assignment and centroid update until convergence or max iterations.
- DBSCAN.fit computes pairwise distances in batches (controlled by `max_mbytes_per_batch`), identifies core samples, and expands clusters via GPU-parallel neighborhood queries.
- HDBSCAN.fit builds a KNN graph (brute force or NN Descent), constructs the mutual reachability graph, builds the minimum spanning tree, extracts the condensed tree, and selects clusters via EOM or leaf method.
Usage
Call `estimator.fit(X)` after constructing and configuring the clustering estimator. The input data X should be a 2D array-like of shape (n_samples, n_features).
Code Reference
KMeans.fit
Source Location
- Repository: cuML
- File:
python/cuml/cuml/cluster/kmeans.pyx - Lines: 512-592
Signature
def fit(self, X, y=None, sample_weight=None, *, convert_dtype=True):
DBSCAN.fit
Source Location
- Repository: cuML
- File:
python/cuml/cuml/cluster/dbscan.pyx - Lines: 301-487
Signature
def fit(self, X, y=None, sample_weight=None, *, out_dtype='int32', convert_dtype=True):
HDBSCAN.fit
Source Location
- Repository: cuML
- File:
python/cuml/cuml/cluster/hdbscan/hdbscan.pyx - Lines: 918-1068
Signature
def fit(self, X, y=None, *, convert_dtype=True):
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | array-like | Yes | Feature matrix of shape (n_samples, n_features), float32 or float64. |
| y | array-like | No (ignored) | Not used. Present for API compatibility. |
| sample_weight | array-like | No (KMeans/DBSCAN only) | Per-sample weights for weighted clustering. |
| convert_dtype | bool | No (default True) | Auto-convert input to float32 if needed. |
| out_dtype | str | No (DBSCAN only, default 'int32') | Label dtype: 'int32' or 'int64'. |
Outputs
| Name | Type | Description |
|---|---|---|
| self | estimator | Returns fitted estimator with computed attributes. |
| KMeans attributes | — | `cluster_centers_`, `labels_`, `inertia_`, `n_iter_` |
| DBSCAN attributes | — | `labels_`, `core_sample_indices_`, `components_` |
| HDBSCAN attributes | — | `labels_`, `probabilities_`, `cluster_persistence_`, `n_clusters_` |
Usage Examples
import cupy as cp
from cuml.cluster import KMeans, DBSCAN, HDBSCAN
X = cp.random.rand(10000, 50, dtype=cp.float32)
# KMeans fitting
kmeans = KMeans(n_clusters=10)
kmeans.fit(X)
print(kmeans.inertia_, kmeans.n_iter_)
# DBSCAN fitting
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X)
print(dbscan.labels_)
# HDBSCAN fitting
hdbscan = HDBSCAN(min_cluster_size=25)
hdbscan.fit(X)
print(hdbscan.n_clusters_, hdbscan.probabilities_)
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment