Principle:Online ml River Incremental KMeans Clustering

Knowledge Sources	Domains	Last Updated
River River Docs Sequential k-Means Clustering Web-scale k-means clustering (Sculley, 2010)	Online Clustering, Unsupervised Learning, Streaming Algorithms	2026-02-08 16:00 GMT

Overview

Incremental K-Means Clustering is an online variant of the K-Means algorithm that incrementally updates cluster centroids as each new data point arrives, without requiring multiple passes over the data.

Description

Classical batch K-Means (Lloyd's algorithm) assigns all data points to cluster centers and then repositions the centers, repeating this process over multiple iterations. This requires storing the entire dataset and making multiple passes, which is infeasible in a streaming setting where data arrives continuously and cannot be stored.

Incremental K-Means solves this by processing one data point at a time. It maintains k cluster centers (centroids). When a new point arrives, the algorithm: (1) finds the nearest center using the Minkowski distance, and (2) moves that center toward the new point by a fraction controlled by the halflife parameter. This creates an exponential moving average effect where each center gradually converges toward the mean of its assigned points.

The halflife parameter (between 0 and 1) determines the learning rate: a higher value means the center moves more aggressively toward new points, while a lower value makes the center more stable and resistant to individual observations. Cluster centers are initialized randomly from a normal distribution parameterized by mu and sigma.

Usage

Use Incremental K-Means when:

You need a simple, fast online clustering algorithm with a fixed number of clusters.
Data arrives in a stream and you cannot store or re-process past observations.
You want an algorithm with intuitive hyperparameters (number of clusters and learning rate).
The clusters are roughly spherical (convex) in shape.

Incremental K-Means is not well-suited for discovering clusters of arbitrary shape or when the number of clusters is unknown.

Theoretical Basis

The update rule for Incremental K-Means is derived from MacQueen's sequential k-means (1967):

Given a new observation x and current cluster centers C = {c_0, c_1, ..., c_{k-1}}:

1. ASSIGNMENT:
   closest = argmin_i  d(x, c_i)
   where d(a, b) = (SUM_j |a_j - b_j|^p)  (Minkowski distance with power p)

2. UPDATE:
   FOR each feature j in x:
       c_closest[j] = c_closest[j] + halflife * (x[j] - c_closest[j])

The update rule is equivalent to an exponentially weighted moving average:

c_new = (1 - halflife) * c_old + halflife * x

This can also be written as:

c_new = c_old + halflife * (x - c_old)

Properties:

When halflife = 0, centers never move (degenerate case).
When halflife = 1, each center jumps directly to the new point (no memory).
Moderate values (e.g., 0.5) provide a balance between adaptability and stability.
The Minkowski distance with p=2 gives standard Euclidean distance; p=1 gives Manhattan distance.

Initialization:

Each center is initialized using a random draw from N(mu, sigma) for each feature dimension as the feature is first encountered, via a defaultdict with a Gaussian random generator.

Related Pages

Implementation:Online_ml_River_Cluster_KMeans

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment