Principle:Online ml River ODAC Clustering
| Knowledge Sources | Machine Learning ODAC: Hierarchical Clustering of Time Series Data Streams |
|---|---|
| Domains | Online_Learning Clustering Time_Series |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Online Divisive-Agglomerative Clustering (ODAC) is an incremental hierarchical clustering algorithm designed for data streams. It builds and maintains a hierarchical cluster tree by combining top-down (divisive) splitting with bottom-up (agglomerative) merging, adapting the tree structure as the data distribution evolves.
Description
Traditional hierarchical clustering algorithms require the entire dataset in memory and produce a fixed dendrogram. ODAC adapts this paradigm to the streaming setting through several key mechanisms:
- Incremental statistics: Each node in the hierarchy maintains sufficient statistics (means, correlations) that are updated incrementally as new observations arrive.
- Statistical split decisions: Nodes are split when there is statistically significant evidence that the data at a node belongs to two or more distinct groups, using confidence bounds to control the decision.
- Agglomerative merging: Sibling nodes that become statistically indistinguishable over time are merged back together, allowing the hierarchy to adapt when previously distinct clusters converge.
- Correlation-based distance: ODAC commonly uses correlation-based dissimilarity as its distance metric, making it well-suited for time series clustering.
The combination of divisive and agglomerative operations gives ODAC the ability to both refine and coarsen the hierarchy as the data distribution changes, distinguishing it from purely top-down or bottom-up approaches.
Usage
Use ODAC clustering when:
- You need hierarchical clustering on streaming data.
- The number of clusters is unknown and may change over time.
- You are clustering time series data and want correlation-based grouping.
- You need the cluster hierarchy to adapt as the data distribution evolves.
Theoretical Basis
Correlation-based dissimilarity: For two time series or feature streams and :
d(X, Y) = (1 - corr(X, Y)) / 2
Where is the Pearson correlation coefficient, computed incrementally using running sums.
Split criterion: A node is split when the diameter (maximum pairwise dissimilarity) within the node exceeds a threshold determined by a confidence bound. Let be the maximum intra-cluster distance and be the second largest:
Split if: d_max - d_second > epsilon(n, delta)
Where is a confidence bound (e.g., Hoeffding bound) computed from the number of observations and confidence parameter .
Merge criterion: Two sibling nodes are merged when their inter-cluster dissimilarity falls below a threshold:
Merge if: d(C_left, C_right) < tau(n, delta)
Incremental correlation: The Pearson correlation is maintained using running sums:
corr(X, Y) = (n * sum_xy - sum_x * sum_y) /
sqrt((n * sum_x2 - sum_x^2) * (n * sum_y2 - sum_y^2))
All sums are updated in per observation.