Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Online ml River HST Feature Scaling Requirement

From Leeroopedia




Knowledge Sources
Domains Anomaly_Detection, Preprocessing, Online_Learning
Last Updated 2026-02-08 16:00 GMT

Overview

Half-Space Trees require features in [0, 1] range; always prepend MinMaxScaler to avoid incorrect anomaly scores.

Description

Half-Space Trees (HST) assume by default that all feature values lie within the [0, 1] range. The tree construction algorithm builds random axis-aligned splits within these bounds, and features outside this range will produce incorrect anomaly scores. This is explicitly documented in the HST docstring and represents a critical preprocessing requirement that is easy to overlook when composing anomaly detection pipelines.

Usage

Apply this heuristic always when using `anomaly.HalfSpaceTrees` in River. If your features are not naturally in the [0, 1] range, prepend a `preprocessing.MinMaxScaler` to the pipeline. Alternatively, if you know the exact feature ranges, pass them via the `limits` parameter to avoid the scaler overhead.

The Insight (Rule of Thumb)

  • Action: Always use `preprocessing.MinMaxScaler() | anomaly.HalfSpaceTrees()` as a pipeline, or provide explicit `limits` parameter.
  • Value: Features must be in [0, 1] range. Default assumption: `limits = {feature: (0, 1) for feature in features}`.
  • Trade-off: MinMaxScaler adds slight overhead per observation but prevents silently incorrect anomaly scores. The `limits` parameter avoids scaler overhead if ranges are known in advance.
  • Score interpretation: HIGH scores indicate anomalies; LOW scores indicate normal observations. This is the opposite of some other anomaly detectors.
  • Window size: Default `window_size=250` controls the reference mass for scoring. Smaller windows adapt faster but produce noisier scores.

Reasoning

HST works by building random half-space splits within the feature range. If a feature value of 1000 is encountered when the tree was built assuming [0, 1], all split nodes will route the observation to the same branch, collapsing the tree's discriminative power. The docstring explicitly warns:

"By default, this implementation assumes that each feature has values that are comprised between 0 and 1. If this isn't the case, then you can manually specify the limits via the `limits` argument. If you do not know the limits in advance, then you can use a `preprocessing.MinMaxScaler` as an initial preprocessing step."

The `size_limit` internal constant is `0.1 * window_size` (from the original paper), and the `padding=0.15` parameter prevents pathological tree shapes where splits are too narrow.

Code Evidence

HST docstring warning from `river/anomaly/hst.py:100-103`:

"""Half-Space Trees (HST).

Half-space trees are an online variant of isolation forests. They work well
when anomalies are spread out. However, they do not work well if anomalies
are packed together in windows.

By default, this implementation assumes that each feature has values that
are comprised between 0 and 1. If this isn't the case, then you can
manually specify the limits via the `limits` argument. If you do not know
the limits in advance, then you can use a `preprocessing.MinMaxScaler` as
an initial preprocessing step.
"""

HST score interpretation from `river/anomaly/hst.py:110`:

# Note that high scores indicate anomalies, whereas low scores indicate
# normal observations.

HST parameters from `river/anomaly/hst.py:112-125`:

# n_trees: Number of trees to use (default 10)
# height: Height of each tree (default 8)
# window_size: Number of observations for mass calculation (default 250)
# limits: Range of each feature (default [0, 1] per feature)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment