Principle:DistrictDataLabs Yellowbrick Silhouette Analysis

Knowledge Sources	Yellowbrick Docs Yellowbrick Rousseeuw 1987
Domains	Machine_Learning, Clustering, Model_Evaluation
Last Updated	2026-02-08 00:00 GMT

Overview

Silhouette Analysis is a method for evaluating the quality of a clustering by measuring how similar each data point is to its own cluster compared to the nearest neighboring cluster.

Description

Silhouette Analysis, introduced by Peter Rousseeuw in 1987, provides both a per-sample and an aggregate measure of clustering quality. The silhouette coefficient for each sample quantifies two properties: cohesion (how tightly a point is bound to its own cluster) and separation (how distinct a point is from the nearest alternative cluster). The resulting score ranges from -1 to +1, where values near +1 indicate that a sample is well-matched to its own cluster and poorly matched to neighboring clusters, values near 0 indicate the sample lies on the boundary between two clusters, and negative values suggest the sample may have been assigned to the wrong cluster.

The per-sample nature of the silhouette coefficient is what makes silhouette analysis particularly powerful as a visual diagnostic. When silhouette values for each cluster are sorted and plotted as horizontal bars (a "silhouette plot"), the shape and width of each cluster's silhouette reveal important structural properties. Wide, uniform silhouettes indicate cohesive, well-separated clusters. Thin silhouettes or silhouettes with many negative values indicate problematic assignments. The mean silhouette score across all samples serves as an overall quality summary.

Silhouette analysis is especially useful for comparing different values of k. By generating silhouette plots for several cluster counts, an analyst can assess not just the aggregate score but also the uniformity and balance of the resulting clusters, something that aggregate metrics like distortion or the Calinski-Harabasz index cannot reveal.

Usage

Use Silhouette Analysis when:

You want to evaluate whether a specific number of clusters produces well-separated, cohesive groups.
You need to diagnose cluster imbalance or poor cluster assignments at the individual sample level.
You are comparing multiple values of k and want richer diagnostic information than a single aggregate score.
You want to identify which specific clusters are problematic (e.g., too small, poorly cohesive, or overlapping with neighbors).

Limitations:

The computation of pairwise distances can be expensive for very large datasets.
Silhouette analysis assumes convex, roughly equally-sized clusters and may give misleading results for clusters with complex shapes or widely varying densities.

Theoretical Basis

Per-Sample Silhouette Coefficient

For a data point $i$ belonging to cluster $C_{I}$ , define:

Intra-cluster distance $a (i)$ : the mean distance from $i$ to all other points in the same cluster:

$a (i) = \frac{1}{| C_{I} | - 1} \sum_{j \in C_{I}, j \neq i} d (i, j)$

Nearest-cluster distance $b (i)$ : the minimum mean distance from $i$ to all points in any other cluster:

$b (i) = \min_{J \neq I} \frac{1}{| C_{J} |} \sum_{j \in C_{J}} d (i, j)$

The silhouette coefficient for sample $i$ is then:

$s (i) = \frac{b (i) - a (i)}{\max (a (i), b (i))}$

This yields values in the range $[- 1, 1]$ .

Mean Silhouette Score

The overall silhouette score for a clustering with $n$ samples is:

$\bar{s} = \frac{1}{n} \sum_{i = 1}^{n} s (i)$

This serves as a global quality indicator. In a silhouette plot, the mean score is typically shown as a vertical dashed line; clusters whose silhouettes extend beyond this line are above-average in quality, while those that fall short indicate weaker clustering.

Visual Interpretation

In a silhouette plot:

Each cluster is represented as a horizontal bar chart of sorted per-sample silhouette values.
The width of the silhouette corresponds to the cluster's score range.
The height of each band reflects the number of samples in that cluster.
The vertical red dashed line marks the mean silhouette score across all samples.
Clusters with roughly equal heights indicate balanced cluster sizes.
Clusters whose silhouettes fall entirely to the right of zero and extend past the mean line are well-formed.

Related Pages

Implemented By

Implementation:DistrictDataLabs_Yellowbrick_SilhouetteVisualizer

Related Principles

Principle:DistrictDataLabs_Yellowbrick_Elbow_Method_Cluster_Selection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment