Principle:Rapidsai Cuml Clustering Algorithm Configuration

Knowledge Sources	Lloyd 1982 - Least squares quantization in PCM Ester et al. 1996 - DBSCAN Campello et al. 2013 - HDBSCAN cuML Docs
Domains	Machine_Learning, Clustering
Last Updated	2026-02-08 00:00 GMT

Overview

Clustering algorithm configuration is the process of selecting and parameterizing the appropriate clustering strategy -- centroid-based, density-based, or hierarchical density-based -- to match the geometric and statistical properties of a dataset.

Description

Different clustering algorithms embody fundamentally different assumptions about the structure of data. Proper configuration requires understanding these assumptions and choosing hyperparameters accordingly:

Centroid-Based Clustering (KMeans): Assumes clusters are convex, roughly spherical, and of similar size. The primary hyperparameters are:

n_clusters (k): The number of clusters to find. Must be specified in advance.
init: The initialization strategy for centroids (e.g., scalable k-means++, random). Good initialization reduces the risk of convergence to poor local optima.
n_init: Number of independent runs with different initializations. The run with lowest inertia is selected.
max_iter: Maximum number of Lloyd's iterations before stopping.
tol: Convergence tolerance on centroid movement.

Density-Based Clustering (DBSCAN): Assumes clusters are regions of high point density separated by regions of low density. No assumption about cluster shape or count is needed. Key hyperparameters are:

eps (epsilon): The maximum distance between two points for them to be considered neighbors. This defines the neighborhood radius.
min_samples: The minimum number of points within eps-distance required for a point to be a core point and form a dense region.
metric: The distance metric used for neighbor queries (euclidean, cosine, precomputed).
algorithm: The nearest-neighbor search method (brute force, random ball cover).

Hierarchical Density-Based Clustering (HDBSCAN): Extends DBSCAN by building a hierarchy of clusterings across all density levels, then extracting the most persistent clusters. Key hyperparameters are:

min_cluster_size: The smallest number of points that can constitute a cluster. Controls the granularity of detected clusters.
min_samples: Controls how conservative the clustering is; higher values make the algorithm more robust to noise but may merge small clusters.
cluster_selection_method: Either 'eom' (Excess of Mass) for variable-density clusters or 'leaf' for fine-grained clusters.
cluster_selection_epsilon: A distance threshold below which clusters are not split, enabling DBSCAN-like behavior at a given scale.
alpha: A scaling parameter for the mutual reachability distance.

Usage

Algorithm configuration should be performed when:

Starting a new clustering task and the data geometry is known or hypothesized.
Tuning hyperparameters after initial exploratory runs.
Switching between algorithms to compare results (e.g., KMeans for speed vs. HDBSCAN for robustness to noise and varying density).

Theoretical Basis

KMeans Configuration Theory: KMeans minimizes the within-cluster sum of squares (WCSS, also called inertia):

J = sum_{i=1}^{k} sum_{x in C_i} ||x - mu_i||^2

where mu_i is the centroid of cluster C_i. The choice of k directly determines the partitioning granularity. The k-means++ initialization samples initial centroids with probability proportional to squared distance from existing centroids, provably achieving an O(log k) approximation to the optimal WCSS.

DBSCAN Configuration Theory: DBSCAN defines clusters through the concepts of core points, directly reachable points, and density-connected points. A point p is a core point if |N_eps(p)| >= min_samples, where N_eps(p) is the set of points within distance eps of p. The eps parameter acts as a spatial resolution dial: too small and clusters fragment, too large and distinct clusters merge. The min_samples parameter controls noise tolerance.

HDBSCAN Configuration Theory: HDBSCAN constructs a mutual reachability graph where the distance between points x and y is:

d_mreach(x, y) = max(core_dist(x), core_dist(y), d(x, y))

where core_dist(x) is the distance from x to its min_samples-th nearest neighbor. A minimum spanning tree of this graph is built, then a condensed cluster tree is extracted. The cluster_selection_method ('eom' or 'leaf') determines how flat clusters are extracted from the hierarchy.

Related Pages

Implemented By

Implementation:Rapidsai_Cuml_KMeans_DBSCAN_HDBSCAN_Init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment