Principle:DistrictDataLabs Yellowbrick Silhouette Analysis
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Clustering, Model_Evaluation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Silhouette Analysis is a method for evaluating the quality of a clustering by measuring how similar each data point is to its own cluster compared to the nearest neighboring cluster.
Description
Silhouette Analysis, introduced by Peter Rousseeuw in 1987, provides both a per-sample and an aggregate measure of clustering quality. The silhouette coefficient for each sample quantifies two properties: cohesion (how tightly a point is bound to its own cluster) and separation (how distinct a point is from the nearest alternative cluster). The resulting score ranges from -1 to +1, where values near +1 indicate that a sample is well-matched to its own cluster and poorly matched to neighboring clusters, values near 0 indicate the sample lies on the boundary between two clusters, and negative values suggest the sample may have been assigned to the wrong cluster.
The per-sample nature of the silhouette coefficient is what makes silhouette analysis particularly powerful as a visual diagnostic. When silhouette values for each cluster are sorted and plotted as horizontal bars (a "silhouette plot"), the shape and width of each cluster's silhouette reveal important structural properties. Wide, uniform silhouettes indicate cohesive, well-separated clusters. Thin silhouettes or silhouettes with many negative values indicate problematic assignments. The mean silhouette score across all samples serves as an overall quality summary.
Silhouette analysis is especially useful for comparing different values of k. By generating silhouette plots for several cluster counts, an analyst can assess not just the aggregate score but also the uniformity and balance of the resulting clusters, something that aggregate metrics like distortion or the Calinski-Harabasz index cannot reveal.
Usage
Use Silhouette Analysis when:
- You want to evaluate whether a specific number of clusters produces well-separated, cohesive groups.
- You need to diagnose cluster imbalance or poor cluster assignments at the individual sample level.
- You are comparing multiple values of k and want richer diagnostic information than a single aggregate score.
- You want to identify which specific clusters are problematic (e.g., too small, poorly cohesive, or overlapping with neighbors).
Limitations:
- The computation of pairwise distances can be expensive for very large datasets.
- Silhouette analysis assumes convex, roughly equally-sized clusters and may give misleading results for clusters with complex shapes or widely varying densities.
Theoretical Basis
Per-Sample Silhouette Coefficient
For a data point belonging to cluster , define:
- Intra-cluster distance : the mean distance from to all other points in the same cluster:
- Nearest-cluster distance : the minimum mean distance from to all points in any other cluster:
The silhouette coefficient for sample is then:
This yields values in the range .
Mean Silhouette Score
The overall silhouette score for a clustering with samples is:
This serves as a global quality indicator. In a silhouette plot, the mean score is typically shown as a vertical dashed line; clusters whose silhouettes extend beyond this line are above-average in quality, while those that fall short indicate weaker clustering.
Visual Interpretation
In a silhouette plot:
- Each cluster is represented as a horizontal bar chart of sorted per-sample silhouette values.
- The width of the silhouette corresponds to the cluster's score range.
- The height of each band reflects the number of samples in that cluster.
- The vertical red dashed line marks the mean silhouette score across all samples.
- Clusters with roughly equal heights indicate balanced cluster sizes.
- Clusters whose silhouettes fall entirely to the right of zero and extend past the mean line are well-formed.