Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Online ml River CluStream Clustering

From Leeroopedia
Revision as of 18:25, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Online_ml_River_CluStream_Clustering.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources Domains Last Updated
River River Docs A Framework for Clustering Evolving Data Streams (Aggarwal et al., 2003) Online Clustering, Temporal Micro-Clusters, Streaming Algorithms 2026-02-08 16:00 GMT

Overview

CluStream is an online clustering framework that maintains temporal micro-clusters for streaming data, combining an online micro-cluster maintenance phase with a periodic offline macro-clustering phase using incremental K-Means.

Description

CluStream was introduced by Aggarwal et al. (2003) as a general framework for clustering evolving data streams. It separates the clustering problem into two phases:

Online Phase: The algorithm maintains a fixed budget of max_micro_clusters micro-cluster summaries. Each micro-cluster is a temporal extension of a cluster feature vector (CF), tracking the number of observations, weighted mean, weighted variance (via Welford's algorithm), and a timestamp distribution. When a new point arrives, the algorithm finds the closest micro-cluster. If the point falls within that micro-cluster's radius (defined as micro_cluster_r_factor times the RMS deviation), it is absorbed. Otherwise, space must be freed: either an old micro-cluster (whose relevance stamp falls below the time window threshold) is deleted, or the two closest micro-clusters are merged.

Offline Phase: Every time_gap time steps, the algorithm extracts the centers of all current micro-clusters and applies an incremental K-Means algorithm on those centers to produce n_macro_clusters final clusters. This periodic reclustering ensures the macro-cluster solution stays current without being triggered on every single observation.

The temporal aspect is key: micro-clusters track when points arrived, not just where. The relevance stamp uses the mean and standard deviation of timestamps to determine if a micro-cluster is still active within the current time_window.

Usage

Use CluStream when:

  • You need a scalable online clustering algorithm with a fixed memory budget (bounded number of micro-clusters).
  • The data stream has a temporal component and you want clustering to respect recency.
  • You want a two-phase approach where the online phase summarizes data and the offline phase produces interpretable clusters.
  • You need a fixed number of output clusters (specified by n_macro_clusters).

CluStream provides more temporal awareness than basic online K-Means but assumes the number of macro-clusters is known in advance.

Theoretical Basis

Cluster Feature Vector (Temporal Extension):

Each micro-cluster maintains:

CF = {
    n:        number of absorbed points (weighted count)
    var_x:    incremental variance per feature dimension (Welford's algorithm)
    var_time: incremental variance of timestamps
}

Center and Radius:

center(mc) = {k: var_x[k].mean for k in features}
deviation(mc) = (SUM_k sqrt(var_x[k].variance)) / num_features
radius(mc) = deviation(mc) * micro_cluster_r_factor

Relevance Stamp (for temporal pruning):

mu_time = var_time.mean
sigma_time = sqrt(var_time.variance)

IF n < 2 * max_micro_clusters:
    relevance_stamp = mu_time
ELSE:
    relevance_stamp = mu_time + sigma_time * quantile(max_micro_clusters / (2 * n))

Where the quantile function is computed using the inverse error function approximation.

Online Update Algorithm:

FOR each new point p at timestamp t:
    closest_mc, closest_dist = find_closest_micro_cluster(p)

    IF closest_mc has only 1 point:
        radius = min distance to any other micro-cluster center
    ELSE:
        radius = closest_mc.deviation * micro_cluster_r_factor

    IF closest_dist < radius:
        closest_mc.insert(p, w, t)  // absorb into micro-cluster
    ELSE:
        // Free space for new micro-cluster:
        threshold = t - time_window
        IF any micro-cluster has relevance_stamp < threshold:
            REPLACE that micro-cluster with new one from p
        ELSE:
            MERGE two closest micro-clusters
            CREATE new micro-cluster from p

    // Periodic macro-clustering
    IF t mod time_gap == time_gap - 1:
        Extract micro-cluster centers
        Apply KMeans(n_macro_clusters) on centers
        Update macro-cluster centers

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment