Principle:Online ml River Streaming Silhouette
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs Silhouettes: a graphical aid to the interpretation and validation of cluster analysis (Rousseeuw, 1987) Machine Learning for Data Streams (Bifet et al., 2018) | Cluster Evaluation, Streaming Metrics, Unsupervised Learning | 2026-02-08 16:00 GMT |
Overview
Streaming Silhouette is an incremental computation of the Silhouette coefficient for evaluating cluster quality in streaming data without storing all past observations.
Description
The Silhouette coefficient is one of the most widely used metrics for evaluating the quality of clustering results. It measures how similar each point is to its own cluster (cohesion) compared to the nearest neighboring cluster (separation). A good clustering produces high cohesion (small intra-cluster distances) and high separation (large inter-cluster distances).
In a batch setting, the Silhouette coefficient requires computing pairwise distances between all data points, which is computationally prohibitive for streaming data. The streaming version used in River takes a fundamentally different approach: instead of computing pairwise distances between all points, it computes distances from each point to the cluster centroids. This makes the metric incremental -- each new point contributes to running sums without needing to revisit past points.
The streaming Silhouette maintains two running sums:
- The cumulative distance from each point to its assigned cluster center (cohesion signal).
- The cumulative distance from each point to its second-closest cluster center (separation signal).
The ratio of these sums provides a measure of cluster quality: lower values indicate better clustering (points are closer to their assigned center relative to other centers).
Usage
Use Streaming Silhouette when:
- You need to evaluate cluster quality in an online/streaming setting.
- You are using a clustering algorithm that exposes cluster centers (e.g.,
cluster.KMeans,cluster.CluStream). - You want a metric that does not require ground truth labels (fully unsupervised evaluation).
- You need an evaluation that updates incrementally without storing past observations.
Note: This metric requires at least 3 clusters to compute the second-closest center distance meaningfully.
Theoretical Basis
Classical Silhouette (batch):
For a single point i assigned to cluster C_i:
a(i) = mean distance from i to all other points in C_i (intra-cluster distance)
b(i) = min over all clusters C != C_i of mean distance from i to all points in C
(nearest-cluster distance)
s(i) = (b(i) - a(i)) / max(a(i), b(i))
Overall Silhouette = mean of s(i) over all points
Streaming Silhouette (River's implementation):
Instead of pairwise point distances, the streaming version uses distances to centroids:
FOR each new point x with predicted cluster y_pred and current centers:
d_closest = distance(x, centers[y_pred])
d_second = second-smallest distance from x to any center
sum_closest += d_closest
sum_second += d_second
Silhouette = sum_closest / sum_second
In River's implementation, lower values are better (contrary to the classical Silhouette where higher is better). This is because the metric computes the ratio a/b rather than (b-a)/max(a,b). A ratio close to 0 means points are very close to their assigned centers relative to other centers (excellent clustering), while a ratio close to 1 or above means poor separation.
The metric supports both update and revert operations, allowing integration with evaluation frameworks that may need to undo an update.