Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Online ml River Online Clustering

From Leeroopedia


Knowledge Sources
Domains Online_ML, Clustering, Unsupervised_Learning, Streaming_Data
Last Updated 2026-02-08 16:00 GMT

Overview

End-to-end process for performing unsupervised clustering on streaming data using incremental clustering algorithms that maintain and evolve cluster assignments in real time.

Description

This workflow covers the procedure for grouping observations into clusters as they arrive one at a time, without storing the full dataset. River provides multiple online clustering algorithms: KMeans (incremental centroid updates), DBSTREAM (density-based micro-cluster maintenance), DenStream (density-based with outlier handling), CluStream (micro-clustering with macro-clustering), and STREAMKMeans (streaming variant with merge operations). The process includes feature preparation, algorithm selection and configuration, incremental cluster assignment, and evaluation using streaming cluster quality metrics.

Usage

Execute this workflow when you need to discover groups or segments in streaming data without predefined labels. Typical applications include customer segmentation on live event streams, real-time network traffic grouping, text stream topic discovery (with TextClust), and sensor data partitioning. The workflow is appropriate when the number or nature of clusters may evolve over time.

Execution Steps

Step 1: Load or Connect to an Unlabeled Data Stream

Obtain a stream of unlabeled observations for clustering. Each observation is a dictionary of numeric features. River does not provide dedicated clustering datasets, but any dataset can be used by ignoring labels. Synthetic generators or custom stream utilities (iter_csv, iter_pandas) provide data sources.

Key considerations:

  • Each observation is a Python dict of numeric features
  • Labels, if available, are only used for external evaluation (not training)
  • Use stream.iter_csv or stream.iter_pandas for custom data sources
  • Synthetic datasets with known cluster structure help validate the approach

Step 2: Preprocess Features

Apply feature scaling to ensure all dimensions contribute equally to distance calculations. StandardScaler or MinMaxScaler normalize features incrementally. For text data, use BagOfWords or TFIDF vectorization before clustering with TextClust.

Key considerations:

  • Distance-based clustering (KMeans, DBSTREAM) is sensitive to feature scales
  • StandardScaler centers features to zero mean and unit variance
  • Preprocessing is chained as a pipeline: scaler | clusterer
  • TextClust has built-in text vectorization; no separate preprocessing needed

Step 3: Select and Configure a Clustering Algorithm

Choose an algorithm based on cluster shape assumptions and data characteristics. KMeans suits spherical clusters with a known cluster count. DBSTREAM discovers arbitrary-shaped clusters based on density. DenStream handles noise and outliers. CluStream uses temporal micro-clusters for evolving streams.

What each algorithm provides:

  • KMeans: Fixed number of clusters, incremental centroid updates with exponential decay (halflife parameter)
  • DBSTREAM: Density-based, discovers cluster count automatically, uses micro-cluster merging
  • DenStream: Online DBSCAN variant with core, potential, and outlier micro-clusters
  • CluStream: Two-phase approach with online micro-clustering and offline macro-clustering
  • STREAMKMeans: Streaming k-means with chunk-based processing and centroid merging
  • TextClust: Specialized for text streams with TF-IDF micro-cluster summarization

Step 4: Perform Incremental Clustering

Process each observation through the clustering pipeline. Call learn_one(x) to update the cluster model and predict_one(x) to get the cluster assignment. The predict-then-learn order allows evaluation before model update. Cluster centers and memberships evolve continuously.

Key considerations:

  • learn_one(x) updates cluster centers or micro-clusters
  • predict_one(x) returns the assigned cluster label (integer)
  • Predictions are available immediately, even on the first observation
  • KMeans initializes centers from the first k observations

Step 5: Evaluate Cluster Quality

When ground truth labels are available, use external cluster evaluation metrics. Silhouette coefficient measures intra-cluster cohesion versus inter-cluster separation. Rand Index and Adjusted Rand Index compare predicted clusters to ground truth. V-Measure, Homogeneity, and Completeness provide entropy-based evaluation. All metrics update incrementally.

Key considerations:

  • Silhouette score ranges from -1 to 1; higher is better
  • Rand Index measures agreement between two clusterings
  • Use progressive evaluation: predict first, evaluate, then learn
  • Internal metrics (Silhouette) work without ground truth

Step 6: Monitor Cluster Evolution

Track how clusters change over time. Inspect the number of active clusters, cluster centers, and cluster sizes. For DBSTREAM and DenStream, monitor the micro-cluster lifecycle (creation, merging, and removal). Adjust algorithm parameters (e.g., DBSTREAM's clustering_threshold or DenStream's epsilon) based on observed behavior.

Key considerations:

  • KMeans cluster centers are accessible via model.centers
  • DBSTREAM automatically manages cluster count based on density
  • Decaying algorithms naturally forget old patterns
  • Time-windowed evaluation shows clustering quality trends

Execution Diagram

GitHub URL

Workflow Repository