Principle:Online ml River CluStream Clustering
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs A Framework for Clustering Evolving Data Streams (Aggarwal et al., 2003) | Online Clustering, Temporal Micro-Clusters, Streaming Algorithms | 2026-02-08 16:00 GMT |
Overview
CluStream is an online clustering framework that maintains temporal micro-clusters for streaming data, combining an online micro-cluster maintenance phase with a periodic offline macro-clustering phase using incremental K-Means.
Description
CluStream was introduced by Aggarwal et al. (2003) as a general framework for clustering evolving data streams. It separates the clustering problem into two phases:
Online Phase: The algorithm maintains a fixed budget of max_micro_clusters micro-cluster summaries. Each micro-cluster is a temporal extension of a cluster feature vector (CF), tracking the number of observations, weighted mean, weighted variance (via Welford's algorithm), and a timestamp distribution. When a new point arrives, the algorithm finds the closest micro-cluster. If the point falls within that micro-cluster's radius (defined as micro_cluster_r_factor times the RMS deviation), it is absorbed. Otherwise, space must be freed: either an old micro-cluster (whose relevance stamp falls below the time window threshold) is deleted, or the two closest micro-clusters are merged.
Offline Phase: Every time_gap time steps, the algorithm extracts the centers of all current micro-clusters and applies an incremental K-Means algorithm on those centers to produce n_macro_clusters final clusters. This periodic reclustering ensures the macro-cluster solution stays current without being triggered on every single observation.
The temporal aspect is key: micro-clusters track when points arrived, not just where. The relevance stamp uses the mean and standard deviation of timestamps to determine if a micro-cluster is still active within the current time_window.
Usage
Use CluStream when:
- You need a scalable online clustering algorithm with a fixed memory budget (bounded number of micro-clusters).
- The data stream has a temporal component and you want clustering to respect recency.
- You want a two-phase approach where the online phase summarizes data and the offline phase produces interpretable clusters.
- You need a fixed number of output clusters (specified by
n_macro_clusters).
CluStream provides more temporal awareness than basic online K-Means but assumes the number of macro-clusters is known in advance.
Theoretical Basis
Cluster Feature Vector (Temporal Extension):
Each micro-cluster maintains:
CF = {
n: number of absorbed points (weighted count)
var_x: incremental variance per feature dimension (Welford's algorithm)
var_time: incremental variance of timestamps
}
Center and Radius:
center(mc) = {k: var_x[k].mean for k in features}
deviation(mc) = (SUM_k sqrt(var_x[k].variance)) / num_features
radius(mc) = deviation(mc) * micro_cluster_r_factor
Relevance Stamp (for temporal pruning):
mu_time = var_time.mean
sigma_time = sqrt(var_time.variance)
IF n < 2 * max_micro_clusters:
relevance_stamp = mu_time
ELSE:
relevance_stamp = mu_time + sigma_time * quantile(max_micro_clusters / (2 * n))
Where the quantile function is computed using the inverse error function approximation.
Online Update Algorithm:
FOR each new point p at timestamp t:
closest_mc, closest_dist = find_closest_micro_cluster(p)
IF closest_mc has only 1 point:
radius = min distance to any other micro-cluster center
ELSE:
radius = closest_mc.deviation * micro_cluster_r_factor
IF closest_dist < radius:
closest_mc.insert(p, w, t) // absorb into micro-cluster
ELSE:
// Free space for new micro-cluster:
threshold = t - time_window
IF any micro-cluster has relevance_stamp < threshold:
REPLACE that micro-cluster with new one from p
ELSE:
MERGE two closest micro-clusters
CREATE new micro-cluster from p
// Periodic macro-clustering
IF t mod time_gap == time_gap - 1:
Extract micro-cluster centers
Apply KMeans(n_macro_clusters) on centers
Update macro-cluster centers