Workflow:Online ml River Online Clustering
| Knowledge Sources | |
|---|---|
| Domains | Online_ML, Clustering, Unsupervised_Learning, Streaming_Data |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
End-to-end process for performing unsupervised clustering on streaming data using incremental clustering algorithms that maintain and evolve cluster assignments in real time.
Description
This workflow covers the procedure for grouping observations into clusters as they arrive one at a time, without storing the full dataset. River provides multiple online clustering algorithms: KMeans (incremental centroid updates), DBSTREAM (density-based micro-cluster maintenance), DenStream (density-based with outlier handling), CluStream (micro-clustering with macro-clustering), and STREAMKMeans (streaming variant with merge operations). The process includes feature preparation, algorithm selection and configuration, incremental cluster assignment, and evaluation using streaming cluster quality metrics.
Usage
Execute this workflow when you need to discover groups or segments in streaming data without predefined labels. Typical applications include customer segmentation on live event streams, real-time network traffic grouping, text stream topic discovery (with TextClust), and sensor data partitioning. The workflow is appropriate when the number or nature of clusters may evolve over time.
Execution Steps
Step 1: Load or Connect to an Unlabeled Data Stream
Obtain a stream of unlabeled observations for clustering. Each observation is a dictionary of numeric features. River does not provide dedicated clustering datasets, but any dataset can be used by ignoring labels. Synthetic generators or custom stream utilities (iter_csv, iter_pandas) provide data sources.
Key considerations:
- Each observation is a Python dict of numeric features
- Labels, if available, are only used for external evaluation (not training)
- Use stream.iter_csv or stream.iter_pandas for custom data sources
- Synthetic datasets with known cluster structure help validate the approach
Step 2: Preprocess Features
Apply feature scaling to ensure all dimensions contribute equally to distance calculations. StandardScaler or MinMaxScaler normalize features incrementally. For text data, use BagOfWords or TFIDF vectorization before clustering with TextClust.
Key considerations:
- Distance-based clustering (KMeans, DBSTREAM) is sensitive to feature scales
- StandardScaler centers features to zero mean and unit variance
- Preprocessing is chained as a pipeline: scaler | clusterer
- TextClust has built-in text vectorization; no separate preprocessing needed
Step 3: Select and Configure a Clustering Algorithm
Choose an algorithm based on cluster shape assumptions and data characteristics. KMeans suits spherical clusters with a known cluster count. DBSTREAM discovers arbitrary-shaped clusters based on density. DenStream handles noise and outliers. CluStream uses temporal micro-clusters for evolving streams.
What each algorithm provides:
- KMeans: Fixed number of clusters, incremental centroid updates with exponential decay (halflife parameter)
- DBSTREAM: Density-based, discovers cluster count automatically, uses micro-cluster merging
- DenStream: Online DBSCAN variant with core, potential, and outlier micro-clusters
- CluStream: Two-phase approach with online micro-clustering and offline macro-clustering
- STREAMKMeans: Streaming k-means with chunk-based processing and centroid merging
- TextClust: Specialized for text streams with TF-IDF micro-cluster summarization
Step 4: Perform Incremental Clustering
Process each observation through the clustering pipeline. Call learn_one(x) to update the cluster model and predict_one(x) to get the cluster assignment. The predict-then-learn order allows evaluation before model update. Cluster centers and memberships evolve continuously.
Key considerations:
- learn_one(x) updates cluster centers or micro-clusters
- predict_one(x) returns the assigned cluster label (integer)
- Predictions are available immediately, even on the first observation
- KMeans initializes centers from the first k observations
Step 5: Evaluate Cluster Quality
When ground truth labels are available, use external cluster evaluation metrics. Silhouette coefficient measures intra-cluster cohesion versus inter-cluster separation. Rand Index and Adjusted Rand Index compare predicted clusters to ground truth. V-Measure, Homogeneity, and Completeness provide entropy-based evaluation. All metrics update incrementally.
Key considerations:
- Silhouette score ranges from -1 to 1; higher is better
- Rand Index measures agreement between two clusterings
- Use progressive evaluation: predict first, evaluate, then learn
- Internal metrics (Silhouette) work without ground truth
Step 6: Monitor Cluster Evolution
Track how clusters change over time. Inspect the number of active clusters, cluster centers, and cluster sizes. For DBSTREAM and DenStream, monitor the micro-cluster lifecycle (creation, merging, and removal). Adjust algorithm parameters (e.g., DBSTREAM's clustering_threshold or DenStream's epsilon) based on observed behavior.
Key considerations:
- KMeans cluster centers are accessible via model.centers
- DBSTREAM automatically manages cluster count based on density
- Decaying algorithms naturally forget old patterns
- Time-windowed evaluation shows clustering quality trends