Implementation:Online ml River Cluster KMeans
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs Sequential k-Means Clustering Web-scale k-means clustering (Sculley, 2010) | Online Clustering, Unsupervised Learning | 2026-02-08 16:00 GMT |
Overview
Concrete tool for performing incremental K-Means clustering in a streaming setting, updating cluster centroids one observation at a time using an exponential moving average controlled by a halflife parameter.
Description
The cluster.KMeans class implements online K-Means clustering. It maintains n_clusters cluster centers stored in a centers dictionary. For each incoming observation, it finds the nearest center using Minkowski distance and moves that center toward the observation by a fraction determined by the halflife parameter. Centers are initialized lazily using a defaultdict that draws from a Gaussian distribution N(mu, sigma), so new feature dimensions are handled automatically.
The class provides three main methods: learn_one(x) to update the model, predict_one(x) to assign a cluster, and learn_predict_one(x) which is an optimized combined operation.
Usage
Import cluster.KMeans when you need a fast, simple online clustering algorithm with a predetermined number of clusters. It is suitable for streaming data where observations arrive one at a time and you need immediate cluster assignments.
Code Reference
Source Location
river/cluster/k_means.py:L12-L134
Signature
class KMeans(base.Clusterer):
def __init__(
self,
n_clusters=5,
halflife=0.5,
mu=0,
sigma=1,
p=2,
seed: int | None = None
)
Import
from river import cluster
Key Parameters
| Parameter | Default | Description |
|---|---|---|
| n_clusters | 5 | Maximum number of clusters to assign. |
| halflife | 0.5 | Amount by which to move cluster centers toward new observations. A value between 0 and 1. |
| mu | 0 | Mean of the normal distribution used to initialize cluster positions. |
| sigma | 1 | Standard deviation of the normal distribution used to initialize cluster positions. |
| p | 2 | Power parameter for the Minkowski distance (2 = Euclidean, 1 = Manhattan). |
| seed | None | Random seed for reproducible initial centroid positions. |
Methods
| Method | Signature | Description |
|---|---|---|
| learn_one | learn_one(x: dict) -> None |
Updates the nearest cluster center toward the observation x. |
| predict_one | predict_one(x: dict) -> int |
Returns the index of the nearest cluster center. |
| learn_predict_one | learn_predict_one(x: dict) -> int |
Combined learn and predict in a single pass (more efficient than calling both separately). |
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
| x | dict |
A dictionary mapping feature names to numeric values. |
Outputs
| Output | Type | Description |
|---|---|---|
| predict_one return | int |
The cluster index (0 to n_clusters-1) of the nearest center. |
| centers attribute | dict[int, defaultdict] |
A dictionary mapping cluster IDs to their centroid positions (defaultdicts of feature values). |
Usage Examples
from river import cluster
from river import stream
X = [
[1, 2],
[1, 4],
[1, 0],
[-4, 2],
[-4, 4],
[-4, 0]
]
k_means = cluster.KMeans(n_clusters=2, halflife=0.1, sigma=3, seed=42)
for i, (x, _) in enumerate(stream.iter_array(X)):
k_means.learn_one(x)
print(f'{X[i]} is assigned to cluster {k_means.predict_one(x)}')
# [1, 2] is assigned to cluster 1
# [1, 4] is assigned to cluster 1
# [1, 0] is assigned to cluster 0
# [-4, 2] is assigned to cluster 1
# [-4, 4] is assigned to cluster 1
# [-4, 0] is assigned to cluster 0
k_means.predict_one({0: 0, 1: 0})
# 0
k_means.predict_one({0: 4, 1: 4})
# 1
Inspecting cluster centers:
from river import cluster
k_means = cluster.KMeans(n_clusters=3, halflife=0.5, seed=0)
# After learning some points
k_means.learn_one({'x': 1.0, 'y': 2.0})
k_means.learn_one({'x': 5.0, 'y': 6.0})
# Access the cluster centers
for cluster_id, center in k_means.centers.items():
print(f'Cluster {cluster_id}: {dict(center)}')