Principle:Rapidsai Cuml Clustering Evaluation

Knowledge Sources	Rousseeuw 1987 - Silhouettes: a graphical aid to the interpretation and validation of cluster analysis Hubert and Arabie 1985 - Comparing partitions cuML Docs
Domains	Machine_Learning, Clustering, Evaluation
Last Updated	2026-02-08 00:00 GMT

Overview

Clustering evaluation is the quantitative assessment of clustering quality using internal validity measures (which require only the data and labels) and external validity measures (which additionally require ground-truth labels) to determine how well an algorithm has partitioned the data.

Description

Unlike supervised learning where accuracy or loss provide direct quality measures, clustering evaluation requires specialized metrics because there is often no ground-truth labeling. Clustering evaluation metrics fall into two categories:

Internal Metrics (no ground truth needed):

Inertia (Within-Cluster Sum of Squares): Measures how tightly clustered the points are around their centroids. Lower inertia indicates more compact clusters. Defined as the sum of squared Euclidean distances from each point to its assigned centroid. Inertia is specific to centroid-based methods like KMeans and decreases monotonically as k increases, so it must be interpreted in context (e.g., using the elbow method).

Silhouette Coefficient: Measures how similar a point is to its own cluster compared to other clusters. For each sample i, the silhouette score is:

s(i) = (b(i) - a(i)) / max(a(i), b(i))

where a(i) is the mean intra-cluster distance (average distance from i to all other points in the same cluster) and b(i) is the mean nearest-cluster distance (average distance from i to all points in the nearest neighboring cluster). The silhouette coefficient ranges from -1 (misclassified) to +1 (well-clustered), with 0 indicating overlap. The overall silhouette score is the mean across all samples.

External Metrics (ground truth required):

Adjusted Rand Index (ARI): Measures the similarity between predicted and ground-truth cluster assignments, corrected for chance. The Rand Index counts the fraction of point pairs that are either in the same cluster in both labelings or in different clusters in both labelings. The Adjusted Rand Index normalizes this score so that random labelings yield an expected score of 0:

ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)

ARI ranges from -1.0 (worse than random) to 1.0 (perfect agreement), with 0.0 indicating random labeling.

Usage

Clustering evaluation is used when:

Comparing the quality of different clustering algorithms on the same dataset.
Selecting the optimal number of clusters (e.g., varying k in KMeans and plotting inertia or silhouette).
Assessing whether a clustering result is meaningful or merely reflects noise.
Validating clustering against known ground-truth labels in benchmark experiments.
Monitoring model quality over time as new data arrives.

Theoretical Basis

Inertia (KMeans Objective):

Inertia = sum_{i=1}^{n} ||x_i - mu_{label(x_i)}||^2

KMeans.score returns the negative of inertia (following the scikit-learn convention where higher scores are better). Inertia is always non-negative and decreases as k increases, reaching zero when k equals n (each point is its own cluster).

Silhouette Coefficient Computation:

The mean silhouette score requires computing all pairwise distances, making it O(n^2) in memory and computation. GPU implementations tile the pairwise distance matrix into chunks to keep memory usage manageable:

For each sample i:
    a(i) = mean distance to all points in the same cluster
    b(i) = min over other clusters c: mean distance to all points in c
    s(i) = (b(i) - a(i)) / max(a(i), b(i))
Silhouette Score = mean(s(i) for all i)

Adjusted Rand Index Computation:

The ARI is computed from the contingency table between the two labelings. Given the contingency table n_ij (number of points in true cluster i and predicted cluster j), the ARI is:

ARI = (sum_ij C(n_ij, 2) - [sum_i C(a_i, 2) * sum_j C(b_j, 2)] / C(n, 2))
      / (0.5 * [sum_i C(a_i, 2) + sum_j C(b_j, 2)] - [sum_i C(a_i, 2) * sum_j C(b_j, 2)] / C(n, 2))

where C(m, 2) = m*(m-1)/2, a_i = sum_j n_ij (row sums), and b_j = sum_i n_ij (column sums).

Related Pages

Implemented By

Implementation:Rapidsai_Cuml_Cluster_Evaluation_Metrics

Uses Heuristic

Heuristic:Rapidsai_Cuml_Float64_Kernel_Stability

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment