Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Online ml River Tree Split Strategies

From Leeroopedia
Revision as of 17:29, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Online_ml_River_Tree_Split_Strategies.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Online_Learning, Decision_Trees
Last Updated 2026-02-08 18:00 GMT

Overview

Tree split strategies (also called attribute observers or splitters) are algorithms that incrementally maintain statistics about each candidate split attribute and evaluate the quality of potential splits without storing the raw data. They are a critical component of online decision trees, determining both the quality of the resulting tree and the computational cost of training.

Each splitter type is designed for a specific combination of feature type (numeric or nominal) and task (classification or regression), and offers different trade-offs between accuracy, memory, and speed.

Theoretical Basis

Split Criteria

The quality of a split is measured by a criterion function. Common choices include:

  • Information gain (entropy): Measures the reduction in entropy after the split.
  • Gini impurity: Measures the probability of misclassification under random labeling.
  • Variance reduction: For regression, measures the reduction in target variance.
  • Variance ratio: The ratio of between-group variance to total variance.
  • Gradient-based: For stochastic gradient trees, measures gradient statistics.

Numeric Attribute Splitters

Numeric features require discretization to evaluate threshold-based splits:

  • Exhaustive Binary Split Tree (EBST): Maintains a balanced binary search tree of all observed values and their class statistics. Evaluates every observed value as a potential threshold. Exact but memory-intensive.
  • Truncated EBST (TEBST): Limits the EBST to a fixed number of candidate thresholds, reducing memory at the cost of granularity.
  • Gaussian splitter: Models the class-conditional distribution of each numeric feature as a Gaussian, using mu and sigma per class. Split candidates are evaluated at the intersections of the Gaussian densities. Memory is O(classes) per attribute.
  • Histogram splitter: Maintains a fixed-size histogram of feature values per class. Split candidates are evaluated at bin boundaries. Controllable trade-off between accuracy and memory.
  • Quantile Observer (QO): Uses streaming quantile estimation to identify split candidates at quantile boundaries, adapting to the feature distribution.
  • Random splitter: Selects a random threshold within the observed range, providing a fast but noisy split evaluation.

Nominal Attribute Splitters

Nominal (categorical) features create one branch per category value:

  • Nominal splitter (classification): Maintains class counts per category value. Evaluates information gain or Gini for the resulting multi-way split.
  • Nominal splitter (regression): Maintains running mean and variance of the target per category value. Evaluates variance reduction.

SGT Quantizer

For stochastic gradient trees, a specialized quantizer discretizes numeric features into bins and tracks gradient statistics per bin, enabling gradient-based split evaluation.

Base Interface

All splitters conform to a common interface:

class Splitter:
    update(x_value, y, sample_weight)    # observe one feature value
    best_split(criterion)                 # return the best split point and its quality
    remove(x_value, y, sample_weight)     # optional: undo an observation

Trade-offs

Splitter Memory Accuracy Speed
EBST O(n) Exact O(n) per evaluation
Gaussian O(classes) Approximate O(1) per update
Histogram O(bins * classes) Approximate O(1) per update
Random O(1) Noisy O(1) per update

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment