Heuristic:Online ml River HST Feature Scaling Requirement
| Knowledge Sources | |
|---|---|
| Domains | Anomaly_Detection, Preprocessing, Online_Learning |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Half-Space Trees require features in [0, 1] range; always prepend MinMaxScaler to avoid incorrect anomaly scores.
Description
Half-Space Trees (HST) assume by default that all feature values lie within the [0, 1] range. The tree construction algorithm builds random axis-aligned splits within these bounds, and features outside this range will produce incorrect anomaly scores. This is explicitly documented in the HST docstring and represents a critical preprocessing requirement that is easy to overlook when composing anomaly detection pipelines.
Usage
Apply this heuristic always when using `anomaly.HalfSpaceTrees` in River. If your features are not naturally in the [0, 1] range, prepend a `preprocessing.MinMaxScaler` to the pipeline. Alternatively, if you know the exact feature ranges, pass them via the `limits` parameter to avoid the scaler overhead.
The Insight (Rule of Thumb)
- Action: Always use `preprocessing.MinMaxScaler() | anomaly.HalfSpaceTrees()` as a pipeline, or provide explicit `limits` parameter.
- Value: Features must be in [0, 1] range. Default assumption: `limits = {feature: (0, 1) for feature in features}`.
- Trade-off: MinMaxScaler adds slight overhead per observation but prevents silently incorrect anomaly scores. The `limits` parameter avoids scaler overhead if ranges are known in advance.
- Score interpretation: HIGH scores indicate anomalies; LOW scores indicate normal observations. This is the opposite of some other anomaly detectors.
- Window size: Default `window_size=250` controls the reference mass for scoring. Smaller windows adapt faster but produce noisier scores.
Reasoning
HST works by building random half-space splits within the feature range. If a feature value of 1000 is encountered when the tree was built assuming [0, 1], all split nodes will route the observation to the same branch, collapsing the tree's discriminative power. The docstring explicitly warns:
"By default, this implementation assumes that each feature has values that are comprised between 0 and 1. If this isn't the case, then you can manually specify the limits via the `limits` argument. If you do not know the limits in advance, then you can use a `preprocessing.MinMaxScaler` as an initial preprocessing step."
The `size_limit` internal constant is `0.1 * window_size` (from the original paper), and the `padding=0.15` parameter prevents pathological tree shapes where splits are too narrow.
Code Evidence
HST docstring warning from `river/anomaly/hst.py:100-103`:
"""Half-Space Trees (HST).
Half-space trees are an online variant of isolation forests. They work well
when anomalies are spread out. However, they do not work well if anomalies
are packed together in windows.
By default, this implementation assumes that each feature has values that
are comprised between 0 and 1. If this isn't the case, then you can
manually specify the limits via the `limits` argument. If you do not know
the limits in advance, then you can use a `preprocessing.MinMaxScaler` as
an initial preprocessing step.
"""
HST score interpretation from `river/anomaly/hst.py:110`:
# Note that high scores indicate anomalies, whereas low scores indicate
# normal observations.
HST parameters from `river/anomaly/hst.py:112-125`:
# n_trees: Number of trees to use (default 10)
# height: Height of each tree (default 8)
# window_size: Number of observations for mass calculation (default 250)
# limits: Range of each feature (default [0, 1] per feature)