Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Online ml River STREAMKMeans Clustering

From Leeroopedia


Knowledge Sources Domains Last Updated
River River Docs Streaming-data algorithms for high-quality clustering (O'Callaghan et al., 2002) Online Clustering, Chunk-Based Streaming, Two-Phase Clustering 2026-02-08 16:00 GMT

Overview

STREAMKMeans is a two-phase streaming k-means algorithm that buffers chunks of data points and periodically applies batch k-means to merge cluster summaries, providing a trade-off between pure online and batch clustering.

Description

STREAMKMeans bridges the gap between fully online (one-point-at-a-time) and fully batch (all-data-at-once) K-Means. Rather than updating centers after every single point, it collects points in a temporary buffer of fixed size (chunk_size). When the buffer is full, a fresh incremental K-Means instance is created and applied to all points in the chunk, producing n_clusters local centroids. These local centroids are then fed into a global K-Means instance that maintains the overall cluster structure.

This approach provides higher quality cluster centers than pure online K-Means (because each chunk is clustered together) while still operating in a streaming fashion (because only one chunk needs to be in memory at a time).

The algorithm is an adaptation of the STREAM framework by O'Callaghan et al. (2002), replacing the original k-medians with LSEARCH by River's incremental K-Means.

Usage

Use STREAMKMeans when:

  • You want higher clustering quality than pure online K-Means without requiring all data in memory.
  • Data arrives in a stream and you can afford to buffer a small number of observations before updating.
  • The number of clusters is known in advance.
  • You want a simple algorithm with easily tunable parameters (chunk size and number of clusters).

The chunk_size parameter controls the quality/latency trade-off: larger chunks give better local clustering but introduce more delay before the model updates.

Theoretical Basis

The STREAMKMeans algorithm operates as follows:

INITIALIZE:
    global_kmeans = KMeans(n_clusters, **kwargs)
    chunk_buffer = {}
    t = 0

FOR each new point p:
    t = t + 1
    Add p to chunk_buffer

    IF chunk_buffer is full (t mod chunk_size == 0):
        // Phase 1: Local clustering on the chunk
        local_kmeans = KMeans(n_clusters, **kwargs)
        FOR each point q in chunk_buffer:
            local_kmeans.learn_one(q)

        // Phase 2: Merge local centers into global model
        FOR each center c in local_kmeans.centers:
            global_kmeans.learn_one(c)

        Clear chunk_buffer

    // Global centers are always available for prediction
    centers = global_kmeans.centers

Key properties:

  • Memory: Only chunk_size points plus the global K-Means centers need to be stored.
  • Quality: Each chunk is clustered together, giving better local structure than processing one point at a time.
  • Hierarchical summarization: Local centroids are treated as "super-observations" that summarize their chunk, then the global model clusters these summaries.
  • Convergence: As more chunks are processed, the global centers improve because they incorporate information from all previously seen centroids.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment