Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Online ml River DBSTREAM Clustering

From Leeroopedia


Knowledge Sources Domains Last Updated
River River Docs Clustering Data Streams Based on Shared Density between Micro-Clusters (Hahsler and Bolanos, 2016) Online Clustering, Density-Based Clustering, Streaming Algorithms 2026-02-08 16:00 GMT

Overview

DBSTREAM is an online density-based clustering algorithm that maintains shared-density micro-clusters connected via a weighted graph for discovering clusters of arbitrary shape in evolving data streams.

Description

DBSTREAM (Density-Based Stream clustering) is a two-phase online clustering algorithm designed for evolving data streams. Unlike partition-based approaches like K-Means that assume a fixed number of spherical clusters, DBSTREAM discovers clusters of arbitrary shape by tracking density relationships between micro-clusters.

The algorithm operates in two phases:

Online Phase (Micro-cluster Maintenance): As each new data point arrives, the algorithm checks whether it falls within the clustering_threshold radius of any existing micro-cluster center. If so, the micro-cluster's weight is increased (with fading applied to account for recency) and its center is moved toward the new point using a Gaussian neighborhood function. If no nearby micro-cluster is found, a new one is created. The algorithm also maintains a shared density graph that tracks how often pairs of micro-clusters simultaneously capture the same incoming points, providing evidence that they belong to the same cluster.

Offline Phase (Macro-clustering): When a cluster prediction is requested, the shared density graph is used to build a connectivity graph among strong micro-clusters. A variant of DBSCAN is applied to this connectivity graph to find connected components, which become the final macro-clusters.

A periodic cleanup process runs every cleanup_interval time steps to remove weak micro-clusters (those whose faded weight falls below a threshold) and weak shared density entries, recovering memory and improving performance.

Usage

Use DBSTREAM when:

  • You need to discover clusters of arbitrary shape in a data stream (not just spherical clusters).
  • The number of clusters is unknown and may change over time.
  • You want density-based clustering that naturally handles noise and outliers.
  • You need a method that explicitly captures density between clusters via the shared density graph.

DBSTREAM is more computationally expensive than simple online K-Means but provides much richer cluster structure information.

Theoretical Basis

DBSTREAM is based on three core algorithms from Hahsler and Bolanos (2016):

Algorithm 1 -- Online Micro-cluster Update:

FOR each new point p:
    neighbors = {mc_i : distance(p, mc_i.center) < clustering_threshold}

    IF neighbors is empty:
        Create new micro-cluster at p with weight = 1
    ELSE:
        FOR each mc_i in neighbors:
            mc_i.weight = mc_i.weight * 2^(-fading_factor * (t - mc_i.last_update)) + 1
            mc_i.center = mc_i.center + G(p, mc_i.center) * (p - mc_i.center)
            mc_i.last_update = t

            FOR each pair (mc_i, mc_j) in neighbors where j > i:
                S[i][j] = S[i][j] * 2^(-fading_factor * (t - S_t[i][j])) + 1
                S_t[i][j] = t

        // Prevent collapsing: revert centers if they come too close
        FOR each pair (mc_i, mc_j) in neighbors:
            IF distance(mc_i.center, mc_j.center) < clustering_threshold:
                Revert mc_i.center and mc_j.center

    t = t + 1

Where G(p, c) is the Gaussian neighborhood function:

G(p, c) = exp(-distance(p, c)^2 / (2 * sigma^2))
where sigma = clustering_threshold / 3

Algorithm 2 -- Cleanup (every cleanup_interval steps):

weight_weak = 2^(-fading_factor * cleanup_interval)

FOR each micro-cluster mc_i:
    IF mc_i.weight * 2^(-fading_factor * (t - mc_i.last_update)) < weight_weak:
        DELETE mc_i

FOR each shared density entry S[i][j]:
    IF S[i][j] * 2^(-fading_factor * (t - S_t[i][j])) < intersection_factor * weight_weak:
        SET S[i][j] = 0

Algorithm 3 -- Reclustering (on predict request):

Build weighted adjacency matrix from shared density:
    FOR each pair (i, j) in S:
        value = S[i][j] / ((mc_i.weight + mc_j.weight) / 2)
        IF value > intersection_factor AND both mc_i, mc_j have weight > minimum_weight:
            Add edge (i, j) to adjacency

Apply DBSCAN-variant on adjacency graph to find connected components
Each connected component becomes a macro-cluster

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment