Principle:Online ml River Incremental Clustering Interface

Knowledge Sources	Domains	Last Updated
River River Docs	Online Clustering, API Design, Abstract Interfaces	2026-02-08 16:00 GMT

Overview

The Incremental Clustering Interface is the abstract interface that all online clustering algorithms in River must implement, defining the learn_one / predict_one contract for streaming unsupervised learning.

Description

River's design philosophy centers on a uniform API for all estimators. For clustering, this means every algorithm -- whether it is K-Means, DBSTREAM, DenStream, CluStream, STREAMKMeans, or TextClust -- must implement the same two core methods:

learn_one(x: dict): Update the clustering model's internal state using one observation represented as a feature dictionary. This method processes the observation immediately and returns, without storing it permanently.
predict_one(x: dict) -> int: Assign a cluster index (integer) to the given observation based on the model's current state. This does not modify the model.

This interface is defined by the base.Clusterer abstract base class, which inherits from base.Estimator. The Clusterer class marks itself as unsupervised (_supervised = False) and declares both methods as abstract, forcing all subclasses to provide concrete implementations.

This uniform interface enables:

Pipeline composition: Any clusterer can be placed in a River pipeline alongside transformers and other components.
Interchangeable algorithms: Users can swap one clustering algorithm for another with zero code changes beyond the constructor.
Consistent evaluation: Metrics like Silhouette and AdjustedRand work identically with any clusterer because they only depend on the predict_one output.

Usage

The Incremental Clustering Interface is a pattern that applies whenever you are:

Implementing a new clustering algorithm for River -- you must subclass base.Clusterer and implement both methods.
Writing generic code that should work with any River clustering algorithm -- program to the Clusterer interface.
Building evaluation pipelines that loop over a stream, calling learn_one and predict_one in sequence.

Theoretical Basis

The interface embodies the core pattern of online unsupervised learning:

INTERFACE Clusterer:
    METHOD learn_one(x: dict[FeatureName, Any]) -> None
        """Process one observation and update internal cluster model."""
        ABSTRACT

    METHOD predict_one(x: dict[FeatureName, Any]) -> int
        """Return the cluster index for one observation without modifying the model."""
        ABSTRACT

    PROPERTY _supervised = False

The learn-then-predict loop:

model = SomeClusterer(...)

FOR each (x, _) in data_stream:
    model.learn_one(x)        // Update the model with the observation
    label = model.predict_one(x)  // Get the cluster assignment
    // Optionally: evaluate, log, or act on the label

Key design decisions:

Dictionaries, not arrays: Observations are dict objects, allowing dynamic features (new keys can appear at any time). This is essential for streaming data where the feature space may evolve.
Integer labels: Cluster assignments are integers, making them compatible with standard metrics and straightforward to compare.
No explicit fit/transform distinction: Unlike batch APIs (e.g., scikit-learn's fit/predict), the learn_one method updates the model incrementally and is called continuously, not once.

Related Pages

Implementation:Online_ml_River_Clusterer_Learn_Predict

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment