Workflow:Online ml River Online Clustering

Knowledge Sources	River River Documentation Online Clustering KDD22
Domains	Online_ML, Clustering, Unsupervised_Learning, Streaming_Data
Last Updated	2026-02-08 16:00 GMT

Overview

End-to-end process for performing unsupervised clustering on streaming data using incremental clustering algorithms that maintain and evolve cluster assignments in real time.

Description

This workflow covers the procedure for grouping observations into clusters as they arrive one at a time, without storing the full dataset. River provides multiple online clustering algorithms: KMeans (incremental centroid updates), DBSTREAM (density-based micro-cluster maintenance), DenStream (density-based with outlier handling), CluStream (micro-clustering with macro-clustering), and STREAMKMeans (streaming variant with merge operations). The process includes feature preparation, algorithm selection and configuration, incremental cluster assignment, and evaluation using streaming cluster quality metrics.

Usage

Execute this workflow when you need to discover groups or segments in streaming data without predefined labels. Typical applications include customer segmentation on live event streams, real-time network traffic grouping, text stream topic discovery (with TextClust), and sensor data partitioning. The workflow is appropriate when the number or nature of clusters may evolve over time.

Execution Steps

Step 1: Load or Connect to an Unlabeled Data Stream

Obtain a stream of unlabeled observations for clustering. Each observation is a dictionary of numeric features. River does not provide dedicated clustering datasets, but any dataset can be used by ignoring labels. Synthetic generators or custom stream utilities (iter_csv, iter_pandas) provide data sources.

Key considerations:

Each observation is a Python dict of numeric features
Labels, if available, are only used for external evaluation (not training)
Use stream.iter_csv or stream.iter_pandas for custom data sources
Synthetic datasets with known cluster structure help validate the approach

Step 2: Preprocess Features

Apply feature scaling to ensure all dimensions contribute equally to distance calculations. StandardScaler or MinMaxScaler normalize features incrementally. For text data, use BagOfWords or TFIDF vectorization before clustering with TextClust.

Key considerations:

Distance-based clustering (KMeans, DBSTREAM) is sensitive to feature scales
StandardScaler centers features to zero mean and unit variance
Preprocessing is chained as a pipeline: scaler | clusterer
TextClust has built-in text vectorization; no separate preprocessing needed

Step 3: Select and Configure a Clustering Algorithm

Choose an algorithm based on cluster shape assumptions and data characteristics. KMeans suits spherical clusters with a known cluster count. DBSTREAM discovers arbitrary-shaped clusters based on density. DenStream handles noise and outliers. CluStream uses temporal micro-clusters for evolving streams.

What each algorithm provides:

KMeans: Fixed number of clusters, incremental centroid updates with exponential decay (halflife parameter)
DBSTREAM: Density-based, discovers cluster count automatically, uses micro-cluster merging
DenStream: Online DBSCAN variant with core, potential, and outlier micro-clusters
CluStream: Two-phase approach with online micro-clustering and offline macro-clustering
STREAMKMeans: Streaming k-means with chunk-based processing and centroid merging
TextClust: Specialized for text streams with TF-IDF micro-cluster summarization

Step 4: Perform Incremental Clustering

Process each observation through the clustering pipeline. Call learn_one(x) to update the cluster model and predict_one(x) to get the cluster assignment. The predict-then-learn order allows evaluation before model update. Cluster centers and memberships evolve continuously.

Key considerations:

learn_one(x) updates cluster centers or micro-clusters
predict_one(x) returns the assigned cluster label (integer)
Predictions are available immediately, even on the first observation
KMeans initializes centers from the first k observations

Step 5: Evaluate Cluster Quality

When ground truth labels are available, use external cluster evaluation metrics. Silhouette coefficient measures intra-cluster cohesion versus inter-cluster separation. Rand Index and Adjusted Rand Index compare predicted clusters to ground truth. V-Measure, Homogeneity, and Completeness provide entropy-based evaluation. All metrics update incrementally.

Key considerations:

Silhouette score ranges from -1 to 1; higher is better
Rand Index measures agreement between two clusterings
Use progressive evaluation: predict first, evaluate, then learn
Internal metrics (Silhouette) work without ground truth

Step 6: Monitor Cluster Evolution

Track how clusters change over time. Inspect the number of active clusters, cluster centers, and cluster sizes. For DBSTREAM and DenStream, monitor the micro-cluster lifecycle (creation, merging, and removal). Adjust algorithm parameters (e.g., DBSTREAM's clustering_threshold or DenStream's epsilon) based on observed behavior.

Key considerations:

KMeans cluster centers are accessible via model.centers
DBSTREAM automatically manages cluster count based on density
Decaying algorithms naturally forget old patterns
Time-windowed evaluation shows clustering quality trends

Execution Diagram

GitHub URL

Workflow Repository