Principle:Online ml River DenStream Clustering
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs Density-Based Clustering over an Evolving Data Stream with Noise (Cao et al., 2006) | Online Clustering, Density-Based Clustering, Streaming Algorithms | 2026-02-08 16:00 GMT |
Overview
DenStream is an online density-based clustering algorithm that distinguishes potential, outlier, and core micro-clusters using exponential decay-weighted density, enabling robust clustering of evolving data streams with noise.
Description
DenStream addresses the challenge of discovering clusters of arbitrary shape in data streams while being robust to noise. It distinguishes between two types of micro-clusters:
- Potential micro-clusters (p-micro-clusters): Micro-clusters with sufficient density (weight above
mu * beta) that represent genuine cluster regions. - Outlier micro-clusters (o-micro-clusters): Micro-clusters with insufficient density that may represent noise or emerging clusters.
The algorithm operates in two phases:
Online Phase: For each new data point, DenStream attempts to merge it into the nearest potential micro-cluster. If the point would cause the micro-cluster's radius to exceed epsilon, it tries the nearest outlier micro-cluster instead. If that also fails, a new outlier micro-cluster is created. Periodically (every T_p timestamps), the algorithm prunes: potential micro-clusters whose weight has decayed below the threshold are deleted, and outlier micro-clusters unlikely to ever become potential ones are removed.
Offline Phase: When a clustering request arrives, a variant of DBSCAN is applied to the potential micro-cluster centers. Density-connected p-micro-clusters form the final clusters.
An initialization phase buffers the first n_samples_init points and applies an initial DBSCAN to create the first set of potential micro-clusters.
Usage
Use DenStream when:
- You need density-based clustering that discovers clusters of arbitrary shape in streams.
- The data may contain noise and outliers that should not corrupt cluster estimates.
- You want the algorithm to automatically adapt to new clusters appearing and old ones disappearing.
- You need explicit separation between established clusters and emerging/noisy regions.
DenStream is particularly effective when the stream has varying density regions and when distinguishing genuine clusters from transient noise is important.
Theoretical Basis
DenStream relies on micro-clusters with exponentially decaying weights.
Micro-cluster summary statistics:
Each micro-cluster maintains:
N: Number of points absorbedLS(linear sum):SUM(x_i)for all absorbed pointsSS(squared sum):SUM(x_i^2)for all absorbed pointscreation_timeandlast_edit_time
Fading function:
f(delta_t) = 2^(-decaying_factor * delta_t)
Weight computation:
weight(mc, t) = N * f(t - last_edit_time)
Center computation:
center(mc, t) = (f(t - last_edit_time) * LS) / weight(mc, t)
Radius computation:
radius(mc, t) = sqrt(|CF2|/weight - (|CF1|/weight)^2)
where CF1, CF2 are the faded linear and squared sums
Online merge procedure:
FOR each new point p:
IF p_micro_clusters is not empty:
c_p = nearest p-micro-cluster to p
Try inserting p into c_p
IF radius(c_p) <= epsilon after insert:
KEEP updated c_p (merge successful)
RETURN
IF o_micro_clusters is not empty:
c_o = nearest o-micro-cluster to p
Try inserting p into c_o
IF radius(c_o) <= epsilon after insert:
IF weight(c_o) > mu * beta:
PROMOTE c_o to p-micro-cluster
ELSE:
KEEP updated c_o
RETURN
CREATE new o-micro-cluster from p
Periodic pruning (every T_p steps):
T_p = ceil((1/decaying_factor) * ln((mu * beta) / (mu * beta - 1)))
FOR each p-micro-cluster c_p:
IF weight(c_p, t) < mu * beta:
DELETE c_p (it has become an outlier)
FOR each o-micro-cluster c_o:
xi = (2^(-decaying_factor*(t - c_o.creation_time + T_p)) - 1) / (2^(-decaying_factor*T_p) - 1)
IF weight(c_o, t) < xi:
DELETE c_o (unlikely to ever become a p-micro-cluster)