Principle:Online ml River ODAC Clustering

Knowledge Sources	Machine Learning ODAC: Hierarchical Clustering of Time Series Data Streams
Domains	Online_Learning Clustering Time_Series
Last Updated	2026-02-08 18:00 GMT

Overview

Online Divisive-Agglomerative Clustering (ODAC) is an incremental hierarchical clustering algorithm designed for data streams. It builds and maintains a hierarchical cluster tree by combining top-down (divisive) splitting with bottom-up (agglomerative) merging, adapting the tree structure as the data distribution evolves.

Description

Traditional hierarchical clustering algorithms require the entire dataset in memory and produce a fixed dendrogram. ODAC adapts this paradigm to the streaming setting through several key mechanisms:

Incremental statistics: Each node in the hierarchy maintains sufficient statistics (means, correlations) that are updated incrementally as new observations arrive.
Statistical split decisions: Nodes are split when there is statistically significant evidence that the data at a node belongs to two or more distinct groups, using confidence bounds to control the decision.
Agglomerative merging: Sibling nodes that become statistically indistinguishable over time are merged back together, allowing the hierarchy to adapt when previously distinct clusters converge.
Correlation-based distance: ODAC commonly uses correlation-based dissimilarity as its distance metric, making it well-suited for time series clustering.

The combination of divisive and agglomerative operations gives ODAC the ability to both refine and coarsen the hierarchy as the data distribution changes, distinguishing it from purely top-down or bottom-up approaches.

Usage

Use ODAC clustering when:

You need hierarchical clustering on streaming data.
The number of clusters is unknown and may change over time.
You are clustering time series data and want correlation-based grouping.
You need the cluster hierarchy to adapt as the data distribution evolves.

Theoretical Basis

Correlation-based dissimilarity: For two time series or feature streams $X$ and $Y$ :

d(X, Y) = (1 - corr(X, Y)) / 2

Where $corr (X, Y)$ is the Pearson correlation coefficient, computed incrementally using running sums.

Split criterion: A node is split when the diameter (maximum pairwise dissimilarity) within the node exceeds a threshold determined by a confidence bound. Let $d_{\max}$ be the maximum intra-cluster distance and $d_{second}$ be the second largest:

Split if:  d_max - d_second > epsilon(n, delta)

Where $ϵ$ is a confidence bound (e.g., Hoeffding bound) computed from the number of observations $n$ and confidence parameter $δ$ .

Merge criterion: Two sibling nodes are merged when their inter-cluster dissimilarity falls below a threshold:

Merge if:  d(C_left, C_right) < tau(n, delta)

Incremental correlation: The Pearson correlation is maintained using running sums:

corr(X, Y) = (n * sum_xy - sum_x * sum_y) /
             sqrt((n * sum_x2 - sum_x^2) * (n * sum_y2 - sum_y^2))

All sums are updated in $O (1)$ per observation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment