Principle:Rapidsai Cuml Cluster Model Fitting

Knowledge Sources	Lloyd 1982 - Least squares quantization in PCM Ester et al. 1996 - DBSCAN Campello et al. 2013 - HDBSCAN cuML Docs
Domains	Machine_Learning, Clustering, GPU_Computing
Last Updated	2026-02-08 00:00 GMT

Overview

Cluster model fitting is the computational process of discovering the latent grouping structure in data by iteratively optimizing cluster assignments on GPU hardware using algorithm-specific optimization strategies.

Description

Model fitting is the core computational step in any clustering workflow. Each algorithm family employs a distinct optimization strategy executed entirely on the GPU:

KMeans (Lloyd's Algorithm): Fitting proceeds through an iterative expectation-maximization loop:

Assignment step (E-step): Every data point is assigned to the nearest centroid using pairwise distance computation.
Update step (M-step): Each centroid is recomputed as the mean of all points assigned to it.
Convergence check: If the maximum centroid displacement falls below the tolerance threshold, or the maximum number of iterations is reached, the algorithm terminates.

On GPU, the assignment step is a highly parallel batched distance computation, and the update step uses parallel reduction to compute means. Sample weights can optionally scale each point's contribution to centroid updates.

DBSCAN (Density Connectivity): Fitting is a single-pass graph traversal:

Neighborhood computation: For each point, compute the eps-neighborhood using the chosen distance metric. This is the most computationally expensive step and benefits from batched GPU pairwise distance computation.
Core point identification: Points with at least min_samples neighbors are marked as core points.
Cluster expansion: Starting from each unvisited core point, perform breadth-first expansion through density-connected points. All density-reachable points receive the same cluster label. Points not reachable from any core point are labeled as noise (-1).

HDBSCAN (Mutual Reachability + Condensed Tree): Fitting involves multiple phases:

Core distance computation: For each point, compute the distance to its min_samples-th nearest neighbor.
Mutual reachability graph: Construct a weighted graph where edge weights are the mutual reachability distance.
Minimum spanning tree: Compute the MST of the mutual reachability graph.
Condensed cluster tree: Build a hierarchical tree from the MST and condense it by removing spurious splits below min_cluster_size.
Cluster extraction: Apply the chosen selection method (EOM or leaf) to extract flat clusters, assigning labels and membership probabilities.

Usage

Model fitting should be invoked after data preparation and algorithm configuration are complete. It is the step that actually discovers clusters in the data. Fitting is required before any prediction, evaluation, or visualization can occur.

Theoretical Basis

KMeans Convergence: Lloyd's algorithm is guaranteed to converge in a finite number of steps because the objective function (inertia) decreases monotonically at each iteration and has a finite number of possible partition states. However, it may converge to a local minimum, which is why multiple initializations (n_init) are used.

The per-iteration complexity is O(n * k * d) where n is the number of samples, k is the number of clusters, and d is the feature dimensionality. On GPU, the dominant cost is the batched pairwise distance matrix of size (batch_size * k).

DBSCAN Complexity: With a brute-force neighborhood search, DBSCAN has O(n^2) complexity in the distance computation phase. The batched GPU implementation reduces wall-clock time by computing the distance matrix in tiles of configurable size (max_mbytes_per_batch). The cluster expansion phase is O(n * min_samples) in the worst case.

HDBSCAN Complexity: The dominant cost is the k-nearest-neighbor graph construction, which is O(n^2 * d) for brute force or O(n * log(n) * d) for tree-based methods. The MST computation is O(E * log(V)) on the mutual reachability graph. The condensed tree extraction and cluster selection are O(n) operations.

KMeans Fitting:
    repeat:
        labels[i] = argmin_j ||x_i - mu_j||^2   for all i   (GPU parallel)
        mu_j = mean(x_i : labels[i] == j)        for all j   (GPU reduction)
    until convergence or max_iter

DBSCAN Fitting:
    for each point p:
        N_eps(p) = {q : d(p, q) <= eps}           (GPU batched distances)
    core_points = {p : |N_eps(p)| >= min_samples}
    BFS expansion from each core point             (label propagation)

HDBSCAN Fitting:
    core_dist(x) = d(x, kth-nearest-neighbor)     (GPU kNN)
    d_mreach(x,y) = max(core_dist(x), core_dist(y), d(x,y))
    MST = minimum_spanning_tree(mutual_reachability_graph)
    condensed_tree = condense(MST, min_cluster_size)
    labels, probs = extract_clusters(condensed_tree, method)

Related Pages

Implemented By

Implementation:Rapidsai_Cuml_KMeans_DBSCAN_HDBSCAN_Fit

Uses Heuristic

Heuristic:Rapidsai_Cuml_Batch_Size_Memory_Tradeoff

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment