Principle:Online ml River Incremental KMeans Clustering
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs Sequential k-Means Clustering Web-scale k-means clustering (Sculley, 2010) | Online Clustering, Unsupervised Learning, Streaming Algorithms | 2026-02-08 16:00 GMT |
Overview
Incremental K-Means Clustering is an online variant of the K-Means algorithm that incrementally updates cluster centroids as each new data point arrives, without requiring multiple passes over the data.
Description
Classical batch K-Means (Lloyd's algorithm) assigns all data points to cluster centers and then repositions the centers, repeating this process over multiple iterations. This requires storing the entire dataset and making multiple passes, which is infeasible in a streaming setting where data arrives continuously and cannot be stored.
Incremental K-Means solves this by processing one data point at a time. It maintains k cluster centers (centroids). When a new point arrives, the algorithm: (1) finds the nearest center using the Minkowski distance, and (2) moves that center toward the new point by a fraction controlled by the halflife parameter. This creates an exponential moving average effect where each center gradually converges toward the mean of its assigned points.
The halflife parameter (between 0 and 1) determines the learning rate: a higher value means the center moves more aggressively toward new points, while a lower value makes the center more stable and resistant to individual observations. Cluster centers are initialized randomly from a normal distribution parameterized by mu and sigma.
Usage
Use Incremental K-Means when:
- You need a simple, fast online clustering algorithm with a fixed number of clusters.
- Data arrives in a stream and you cannot store or re-process past observations.
- You want an algorithm with intuitive hyperparameters (number of clusters and learning rate).
- The clusters are roughly spherical (convex) in shape.
Incremental K-Means is not well-suited for discovering clusters of arbitrary shape or when the number of clusters is unknown.
Theoretical Basis
The update rule for Incremental K-Means is derived from MacQueen's sequential k-means (1967):
Given a new observation x and current cluster centers C = {c_0, c_1, ..., c_{k-1}}:
1. ASSIGNMENT:
closest = argmin_i d(x, c_i)
where d(a, b) = (SUM_j |a_j - b_j|^p) (Minkowski distance with power p)
2. UPDATE:
FOR each feature j in x:
c_closest[j] = c_closest[j] + halflife * (x[j] - c_closest[j])
The update rule is equivalent to an exponentially weighted moving average:
c_new = (1 - halflife) * c_old + halflife * x
This can also be written as:
c_new = c_old + halflife * (x - c_old)
Properties:
- When
halflife = 0, centers never move (degenerate case). - When
halflife = 1, each center jumps directly to the new point (no memory). - Moderate values (e.g., 0.5) provide a balance between adaptability and stability.
- The Minkowski distance with
p=2gives standard Euclidean distance;p=1gives Manhattan distance.
Initialization:
Each center is initialized using a random draw from N(mu, sigma) for each feature dimension as the feature is first encountered, via a defaultdict with a Gaussian random generator.