Principle:Online ml River Half Space Trees Anomaly Detection
| Knowledge Sources | River River Docs Fast Anomaly Detection for Streaming Data |
|---|---|
| Domains | Online Machine Learning, Anomaly Detection, Ensemble Methods |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Ensemble-based anomaly detection algorithm that uses randomly partitioned half-space trees to score observations based on mass estimation, designed specifically for streaming data.
Description
Half-Space Trees (HST) are an online variant of isolation forests, designed for fast anomaly detection on streaming data. The core idea is that anomalous points tend to be isolated in low-density regions of the feature space, and this isolation can be detected by measuring the "mass" (number of observations) that falls into randomly defined partitions.
The algorithm builds an ensemble of random binary trees. Each tree partitions the feature space using random axis-aligned splits. For each tree, every internal node defines a random split on a randomly chosen feature at a randomly chosen value within that feature's range. Each leaf and internal node maintains a mass counter that tracks how many observations have passed through it.
The algorithm uses a dual-window mechanism:
- Working window (l_mass): Counts observations from the current window.
- Reference window (r_mass): Stores the counts from the previous window.
After window_size observations, the working window is "pivoted" to become the reference window, and the working window counters are reset to zero. Anomaly scoring uses the reference window masses.
The anomaly score for an observation is computed by traversing each tree from root to the node where the reference mass falls below a threshold. Observations that land in regions with low reference mass (shallow traversal) receive high anomaly scores, indicating they are in sparse regions. The final score is normalized to [0, 1] where high scores indicate anomalies.
Half-Space Trees work well when anomalies are spread out in the feature space. They do not perform well when anomalies are packed together in clusters within a single window.
Usage
Use Half-Space Trees when:
- You need online, one-pass anomaly detection on a data stream
- Anomalies are expected to be isolated points in sparse regions
- Features are bounded (or can be normalized to [0, 1] via MinMaxScaler)
- You need a lightweight ensemble method with tunable complexity (number of trees, tree height)
- You want anomaly scores normalized to [0, 1]
Theoretical Basis
Reference: Tan, S.C., Ting, K.M. and Liu, T.F., 2011. "Fast Anomaly Detection for Streaming Data." Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI).
Tree Construction:
Each tree is a complete binary tree of a specified height h. Each internal node stores:
- A randomly chosen feature dimension
- A randomly chosen split value within the feature's range
- Two mass counters:
r_mass(reference) andl_mass(working)
BUILD_TREE(limits, height, rng):
if height == 0:
return Leaf(r_mass=0, l_mass=0)
feature = rng.choice(features)
split = rng.uniform(limits[feature].min, limits[feature].max)
left = BUILD_TREE(limits, height-1, rng)
right = BUILD_TREE(limits, height-1, rng)
return Node(feature, split, left, right, r_mass=0, l_mass=0)
Learning (updating mass counters):
LEARN_ONE(x):
for each tree t in ensemble:
for each node on the path from root to leaf for x:
node.l_mass += 1
counter += 1
if counter == window_size:
PIVOT_WINDOWS()
counter = 0
PIVOT_WINDOWS():
for each tree t:
for each node in t:
node.r_mass = node.l_mass
node.l_mass = 0
Scoring:
SCORE_ONE(x):
if in first window:
return 0 # no reference data yet
score = 0
for each tree t:
for depth, node in walk(t, x):
score += node.r_mass * 2^depth
if node.r_mass < 0.1 * window_size:
break # stop at sparse region
score = score / max_possible_score
return 1 - score # invert: high = anomalous
Key properties:
- Score range: [0, 1], where 1 = highly anomalous, 0 = normal
- Size limit threshold:
0.1 * window_size(magic constant from the original paper) - Maximum possible score:
n_trees * window_size * (2^(h+1) - 1) - Time complexity: O(n_trees * 2^height) per learn/score call
- Space complexity: O(n_trees * 2^height) nodes total