Implementation:Online ml River Cluster KMeans

Knowledge Sources	Domains	Last Updated
River River Docs Sequential k-Means Clustering Web-scale k-means clustering (Sculley, 2010)	Online Clustering, Unsupervised Learning	2026-02-08 16:00 GMT

Overview

Concrete tool for performing incremental K-Means clustering in a streaming setting, updating cluster centroids one observation at a time using an exponential moving average controlled by a halflife parameter.

Description

The cluster.KMeans class implements online K-Means clustering. It maintains n_clusters cluster centers stored in a centers dictionary. For each incoming observation, it finds the nearest center using Minkowski distance and moves that center toward the observation by a fraction determined by the halflife parameter. Centers are initialized lazily using a defaultdict that draws from a Gaussian distribution N(mu, sigma), so new feature dimensions are handled automatically.

The class provides three main methods: learn_one(x) to update the model, predict_one(x) to assign a cluster, and learn_predict_one(x) which is an optimized combined operation.

Usage

Import cluster.KMeans when you need a fast, simple online clustering algorithm with a predetermined number of clusters. It is suitable for streaming data where observations arrive one at a time and you need immediate cluster assignments.

Code Reference

Source Location

river/cluster/k_means.py:L12-L134

Signature

class KMeans(base.Clusterer):
    def __init__(
        self,
        n_clusters=5,
        halflife=0.5,
        mu=0,
        sigma=1,
        p=2,
        seed: int | None = None
    )

Import

from river import cluster

Key Parameters

Parameter	Default	Description
n_clusters	5	Maximum number of clusters to assign.
halflife	0.5	Amount by which to move cluster centers toward new observations. A value between 0 and 1.
mu	0	Mean of the normal distribution used to initialize cluster positions.
sigma	1	Standard deviation of the normal distribution used to initialize cluster positions.
p	2	Power parameter for the Minkowski distance (2 = Euclidean, 1 = Manhattan).
seed	None	Random seed for reproducible initial centroid positions.

Methods

Method	Signature	Description
learn_one	`learn_one(x: dict) -> None`	Updates the nearest cluster center toward the observation x.
predict_one	`predict_one(x: dict) -> int`	Returns the index of the nearest cluster center.
learn_predict_one	`learn_predict_one(x: dict) -> int`	Combined learn and predict in a single pass (more efficient than calling both separately).

I/O Contract

Inputs

Parameter	Type	Description
x	`dict`	A dictionary mapping feature names to numeric values.

Outputs

Output	Type	Description
predict_one return	`int`	The cluster index (0 to n_clusters-1) of the nearest center.
centers attribute	`dict[int, defaultdict]`	A dictionary mapping cluster IDs to their centroid positions (defaultdicts of feature values).

Usage Examples

from river import cluster
from river import stream

X = [
    [1, 2],
    [1, 4],
    [1, 0],
    [-4, 2],
    [-4, 4],
    [-4, 0]
]

k_means = cluster.KMeans(n_clusters=2, halflife=0.1, sigma=3, seed=42)

for i, (x, _) in enumerate(stream.iter_array(X)):
    k_means.learn_one(x)
    print(f'{X[i]} is assigned to cluster {k_means.predict_one(x)}')
# [1, 2] is assigned to cluster 1
# [1, 4] is assigned to cluster 1
# [1, 0] is assigned to cluster 0
# [-4, 2] is assigned to cluster 1
# [-4, 4] is assigned to cluster 1
# [-4, 0] is assigned to cluster 0

k_means.predict_one({0: 0, 1: 0})
# 0

k_means.predict_one({0: 4, 1: 4})
# 1

Inspecting cluster centers:

from river import cluster

k_means = cluster.KMeans(n_clusters=3, halflife=0.5, seed=0)

# After learning some points
k_means.learn_one({'x': 1.0, 'y': 2.0})
k_means.learn_one({'x': 5.0, 'y': 6.0})

# Access the cluster centers
for cluster_id, center in k_means.centers.items():
    print(f'Cluster {cluster_id}: {dict(center)}')

Related Pages

Principle:Online_ml_River_Incremental_KMeans_Clustering

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment