Principle:Online ml River Tree Split Strategies
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Decision_Trees |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Tree split strategies (also called attribute observers or splitters) are algorithms that incrementally maintain statistics about each candidate split attribute and evaluate the quality of potential splits without storing the raw data. They are a critical component of online decision trees, determining both the quality of the resulting tree and the computational cost of training.
Each splitter type is designed for a specific combination of feature type (numeric or nominal) and task (classification or regression), and offers different trade-offs between accuracy, memory, and speed.
Theoretical Basis
Split Criteria
The quality of a split is measured by a criterion function. Common choices include:
- Information gain (entropy): Measures the reduction in entropy after the split.
- Gini impurity: Measures the probability of misclassification under random labeling.
- Variance reduction: For regression, measures the reduction in target variance.
- Variance ratio: The ratio of between-group variance to total variance.
- Gradient-based: For stochastic gradient trees, measures gradient statistics.
Numeric Attribute Splitters
Numeric features require discretization to evaluate threshold-based splits:
- Exhaustive Binary Split Tree (EBST): Maintains a balanced binary search tree of all observed values and their class statistics. Evaluates every observed value as a potential threshold. Exact but memory-intensive.
- Truncated EBST (TEBST): Limits the EBST to a fixed number of candidate thresholds, reducing memory at the cost of granularity.
- Gaussian splitter: Models the class-conditional distribution of each numeric feature as a Gaussian, using mu and sigma per class. Split candidates are evaluated at the intersections of the Gaussian densities. Memory is O(classes) per attribute.
- Histogram splitter: Maintains a fixed-size histogram of feature values per class. Split candidates are evaluated at bin boundaries. Controllable trade-off between accuracy and memory.
- Quantile Observer (QO): Uses streaming quantile estimation to identify split candidates at quantile boundaries, adapting to the feature distribution.
- Random splitter: Selects a random threshold within the observed range, providing a fast but noisy split evaluation.
Nominal Attribute Splitters
Nominal (categorical) features create one branch per category value:
- Nominal splitter (classification): Maintains class counts per category value. Evaluates information gain or Gini for the resulting multi-way split.
- Nominal splitter (regression): Maintains running mean and variance of the target per category value. Evaluates variance reduction.
SGT Quantizer
For stochastic gradient trees, a specialized quantizer discretizes numeric features into bins and tracks gradient statistics per bin, enabling gradient-based split evaluation.
Base Interface
All splitters conform to a common interface:
class Splitter:
update(x_value, y, sample_weight) # observe one feature value
best_split(criterion) # return the best split point and its quality
remove(x_value, y, sample_weight) # optional: undo an observation
Trade-offs
| Splitter | Memory | Accuracy | Speed |
|---|---|---|---|
| EBST | O(n) | Exact | O(n) per evaluation |
| Gaussian | O(classes) | Approximate | O(1) per update |
| Histogram | O(bins * classes) | Approximate | O(1) per update |
| Random | O(1) | Noisy | O(1) per update |
Related Pages
- Implementation:Online_ml_River_Tree_Splitter_Base
- Implementation:Online_ml_River_Tree_Splitter_EBST
- Implementation:Online_ml_River_Tree_Splitter_Exhaustive
- Implementation:Online_ml_River_Tree_Splitter_Gaussian
- Implementation:Online_ml_River_Tree_Splitter_Histogram
- Implementation:Online_ml_River_Tree_Splitter_NominalClassif
- Implementation:Online_ml_River_Tree_Splitter_NominalReg
- Implementation:Online_ml_River_Tree_Splitter_QO
- Implementation:Online_ml_River_Tree_Splitter_Random
- Implementation:Online_ml_River_Tree_Splitter_SGTQuantizer
- Implementation:Online_ml_River_Tree_Splitter_TEBST