Principle:Online ml River STREAMKMeans Clustering
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs Streaming-data algorithms for high-quality clustering (O'Callaghan et al., 2002) | Online Clustering, Chunk-Based Streaming, Two-Phase Clustering | 2026-02-08 16:00 GMT |
Overview
STREAMKMeans is a two-phase streaming k-means algorithm that buffers chunks of data points and periodically applies batch k-means to merge cluster summaries, providing a trade-off between pure online and batch clustering.
Description
STREAMKMeans bridges the gap between fully online (one-point-at-a-time) and fully batch (all-data-at-once) K-Means. Rather than updating centers after every single point, it collects points in a temporary buffer of fixed size (chunk_size). When the buffer is full, a fresh incremental K-Means instance is created and applied to all points in the chunk, producing n_clusters local centroids. These local centroids are then fed into a global K-Means instance that maintains the overall cluster structure.
This approach provides higher quality cluster centers than pure online K-Means (because each chunk is clustered together) while still operating in a streaming fashion (because only one chunk needs to be in memory at a time).
The algorithm is an adaptation of the STREAM framework by O'Callaghan et al. (2002), replacing the original k-medians with LSEARCH by River's incremental K-Means.
Usage
Use STREAMKMeans when:
- You want higher clustering quality than pure online K-Means without requiring all data in memory.
- Data arrives in a stream and you can afford to buffer a small number of observations before updating.
- The number of clusters is known in advance.
- You want a simple algorithm with easily tunable parameters (chunk size and number of clusters).
The chunk_size parameter controls the quality/latency trade-off: larger chunks give better local clustering but introduce more delay before the model updates.
Theoretical Basis
The STREAMKMeans algorithm operates as follows:
INITIALIZE:
global_kmeans = KMeans(n_clusters, **kwargs)
chunk_buffer = {}
t = 0
FOR each new point p:
t = t + 1
Add p to chunk_buffer
IF chunk_buffer is full (t mod chunk_size == 0):
// Phase 1: Local clustering on the chunk
local_kmeans = KMeans(n_clusters, **kwargs)
FOR each point q in chunk_buffer:
local_kmeans.learn_one(q)
// Phase 2: Merge local centers into global model
FOR each center c in local_kmeans.centers:
global_kmeans.learn_one(c)
Clear chunk_buffer
// Global centers are always available for prediction
centers = global_kmeans.centers
Key properties:
- Memory: Only
chunk_sizepoints plus the global K-Means centers need to be stored. - Quality: Each chunk is clustered together, giving better local structure than processing one point at a time.
- Hierarchical summarization: Local centroids are treated as "super-observations" that summarize their chunk, then the global model clusters these summaries.
- Convergence: As more chunks are processed, the global centers improve because they incorporate information from all previously seen centroids.