Implementation:Scikit learn Scikit learn UnsupervisedClusterMetrics
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Clustering |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for evaluating clustering quality without ground truth labels provided by scikit-learn.
Description
The unsupervised clustering metrics module provides functions to evaluate clustering quality when no ground truth labels are available. It includes the Silhouette Coefficient (measuring how similar each sample is to its own cluster vs. other clusters), the Calinski-Harabasz Index (ratio of between-cluster to within-cluster dispersion), and the Davies-Bouldin Index (average similarity ratio of each cluster with its most similar cluster). These metrics rely only on the data and the predicted cluster assignments.
Usage
Use these metrics when evaluating clustering results without access to ground truth labels, for comparing different numbers of clusters (k selection), or for assessing the quality of cluster separation and cohesion.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/metrics/cluster/_unsupervised.py
Signature
def silhouette_score(X, labels, *, metric="euclidean", sample_size=None, random_state=None, **kwds)
def silhouette_samples(X, labels, *, metric="euclidean", **kwds)
def calinski_harabasz_score(X, labels)
def davies_bouldin_score(X, labels)
Import
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | array-like or sparse matrix of shape (n_samples, n_features) | Yes | Feature array or precomputed distance matrix |
| labels | array-like of shape (n_samples,) | Yes | Predicted cluster labels for each sample |
| metric | str or callable | No | Distance metric to use (default euclidean); supports all pairwise distance metrics |
| sample_size | int | No | Size of random sample for silhouette computation (None uses all samples) |
| random_state | int, RandomState or None | No | Random state for reproducible sampling in silhouette_score |
Outputs
| Name | Type | Description |
|---|---|---|
| silhouette_score | float | Mean Silhouette Coefficient over all samples (range [-1, 1], higher is better) |
| silhouette_samples | ndarray of shape (n_samples,) | Silhouette Coefficient for each sample |
| calinski_harabasz_score | float | Calinski-Harabasz Index (higher is better) |
| davies_bouldin_score | float | Davies-Bouldin Index (lower is better, minimum 0) |
Usage Examples
Basic Usage
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
# Generate sample data
X, _ = make_blobs(n_samples=500, n_features=2, centers=4, random_state=42)
# Cluster the data
kmeans = KMeans(n_clusters=4, random_state=42, n_init="auto")
labels = kmeans.fit_predict(X)
# Evaluate clustering quality
sil = silhouette_score(X, labels)
ch = calinski_harabasz_score(X, labels)
db = davies_bouldin_score(X, labels)
print(f"Silhouette Score: {sil:.3f}")
print(f"Calinski-Harabasz Index: {ch:.3f}")
print(f"Davies-Bouldin Index: {db:.3f}")