Principle:Online ml River Incremental Clustering Interface
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs | Online Clustering, API Design, Abstract Interfaces | 2026-02-08 16:00 GMT |
Overview
The Incremental Clustering Interface is the abstract interface that all online clustering algorithms in River must implement, defining the learn_one / predict_one contract for streaming unsupervised learning.
Description
River's design philosophy centers on a uniform API for all estimators. For clustering, this means every algorithm -- whether it is K-Means, DBSTREAM, DenStream, CluStream, STREAMKMeans, or TextClust -- must implement the same two core methods:
learn_one(x: dict): Update the clustering model's internal state using one observation represented as a feature dictionary. This method processes the observation immediately and returns, without storing it permanently.predict_one(x: dict) -> int: Assign a cluster index (integer) to the given observation based on the model's current state. This does not modify the model.
This interface is defined by the base.Clusterer abstract base class, which inherits from base.Estimator. The Clusterer class marks itself as unsupervised (_supervised = False) and declares both methods as abstract, forcing all subclasses to provide concrete implementations.
This uniform interface enables:
- Pipeline composition: Any clusterer can be placed in a River pipeline alongside transformers and other components.
- Interchangeable algorithms: Users can swap one clustering algorithm for another with zero code changes beyond the constructor.
- Consistent evaluation: Metrics like Silhouette and AdjustedRand work identically with any clusterer because they only depend on the
predict_oneoutput.
Usage
The Incremental Clustering Interface is a pattern that applies whenever you are:
- Implementing a new clustering algorithm for River -- you must subclass
base.Clustererand implement both methods. - Writing generic code that should work with any River clustering algorithm -- program to the
Clustererinterface. - Building evaluation pipelines that loop over a stream, calling
learn_oneandpredict_onein sequence.
Theoretical Basis
The interface embodies the core pattern of online unsupervised learning:
INTERFACE Clusterer:
METHOD learn_one(x: dict[FeatureName, Any]) -> None
"""Process one observation and update internal cluster model."""
ABSTRACT
METHOD predict_one(x: dict[FeatureName, Any]) -> int
"""Return the cluster index for one observation without modifying the model."""
ABSTRACT
PROPERTY _supervised = False
The learn-then-predict loop:
model = SomeClusterer(...)
FOR each (x, _) in data_stream:
model.learn_one(x) // Update the model with the observation
label = model.predict_one(x) // Get the cluster assignment
// Optionally: evaluate, log, or act on the label
Key design decisions:
- Dictionaries, not arrays: Observations are
dictobjects, allowing dynamic features (new keys can appear at any time). This is essential for streaming data where the feature space may evolve. - Integer labels: Cluster assignments are integers, making them compatible with standard metrics and straightforward to compare.
- No explicit fit/transform distinction: Unlike batch APIs (e.g., scikit-learn's
fit/predict), thelearn_onemethod updates the model incrementally and is called continuously, not once.