Principle:Online ml River DBSTREAM Clustering
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs Clustering Data Streams Based on Shared Density between Micro-Clusters (Hahsler and Bolanos, 2016) | Online Clustering, Density-Based Clustering, Streaming Algorithms | 2026-02-08 16:00 GMT |
Overview
DBSTREAM is an online density-based clustering algorithm that maintains shared-density micro-clusters connected via a weighted graph for discovering clusters of arbitrary shape in evolving data streams.
Description
DBSTREAM (Density-Based Stream clustering) is a two-phase online clustering algorithm designed for evolving data streams. Unlike partition-based approaches like K-Means that assume a fixed number of spherical clusters, DBSTREAM discovers clusters of arbitrary shape by tracking density relationships between micro-clusters.
The algorithm operates in two phases:
Online Phase (Micro-cluster Maintenance): As each new data point arrives, the algorithm checks whether it falls within the clustering_threshold radius of any existing micro-cluster center. If so, the micro-cluster's weight is increased (with fading applied to account for recency) and its center is moved toward the new point using a Gaussian neighborhood function. If no nearby micro-cluster is found, a new one is created. The algorithm also maintains a shared density graph that tracks how often pairs of micro-clusters simultaneously capture the same incoming points, providing evidence that they belong to the same cluster.
Offline Phase (Macro-clustering): When a cluster prediction is requested, the shared density graph is used to build a connectivity graph among strong micro-clusters. A variant of DBSCAN is applied to this connectivity graph to find connected components, which become the final macro-clusters.
A periodic cleanup process runs every cleanup_interval time steps to remove weak micro-clusters (those whose faded weight falls below a threshold) and weak shared density entries, recovering memory and improving performance.
Usage
Use DBSTREAM when:
- You need to discover clusters of arbitrary shape in a data stream (not just spherical clusters).
- The number of clusters is unknown and may change over time.
- You want density-based clustering that naturally handles noise and outliers.
- You need a method that explicitly captures density between clusters via the shared density graph.
DBSTREAM is more computationally expensive than simple online K-Means but provides much richer cluster structure information.
Theoretical Basis
DBSTREAM is based on three core algorithms from Hahsler and Bolanos (2016):
Algorithm 1 -- Online Micro-cluster Update:
FOR each new point p:
neighbors = {mc_i : distance(p, mc_i.center) < clustering_threshold}
IF neighbors is empty:
Create new micro-cluster at p with weight = 1
ELSE:
FOR each mc_i in neighbors:
mc_i.weight = mc_i.weight * 2^(-fading_factor * (t - mc_i.last_update)) + 1
mc_i.center = mc_i.center + G(p, mc_i.center) * (p - mc_i.center)
mc_i.last_update = t
FOR each pair (mc_i, mc_j) in neighbors where j > i:
S[i][j] = S[i][j] * 2^(-fading_factor * (t - S_t[i][j])) + 1
S_t[i][j] = t
// Prevent collapsing: revert centers if they come too close
FOR each pair (mc_i, mc_j) in neighbors:
IF distance(mc_i.center, mc_j.center) < clustering_threshold:
Revert mc_i.center and mc_j.center
t = t + 1
Where G(p, c) is the Gaussian neighborhood function:
G(p, c) = exp(-distance(p, c)^2 / (2 * sigma^2))
where sigma = clustering_threshold / 3
Algorithm 2 -- Cleanup (every cleanup_interval steps):
weight_weak = 2^(-fading_factor * cleanup_interval)
FOR each micro-cluster mc_i:
IF mc_i.weight * 2^(-fading_factor * (t - mc_i.last_update)) < weight_weak:
DELETE mc_i
FOR each shared density entry S[i][j]:
IF S[i][j] * 2^(-fading_factor * (t - S_t[i][j])) < intersection_factor * weight_weak:
SET S[i][j] = 0
Algorithm 3 -- Reclustering (on predict request):
Build weighted adjacency matrix from shared density:
FOR each pair (i, j) in S:
value = S[i][j] / ((mc_i.weight + mc_j.weight) / 2)
IF value > intersection_factor AND both mc_i, mc_j have weight > minimum_weight:
Add edge (i, j) to adjacency
Apply DBSCAN-variant on adjacency graph to find connected components
Each connected component becomes a macro-cluster