Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Online ml River Tree HoeffdingTreeClassifier

From Leeroopedia


Knowledge Sources River River Docs Mining High-Speed Data Streams
Domains Online_Learning Classification Decision_Trees
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete tool for incremental decision tree classification using the Hoeffding bound for statistically grounded split decisions, supporting multiple split criteria and leaf prediction strategies.

Description

The tree.HoeffdingTreeClassifier class implements the Hoeffding Tree (also known as Very Fast Decision Tree or VFDT) algorithm for online classification. The tree grows incrementally: each leaf node collects sufficient statistics as observations arrive, and when enough evidence accumulates (as determined by the Hoeffding bound), the leaf is replaced by a split node.

The implementation provides extensive configurability:

Split criteria:

  • "info_gain" (default): Information gain based on entropy.
  • "gini": Gini impurity index.
  • "hellinger": Hellinger distance between class distributions.

Leaf prediction strategies:

  • "nba" (default): Naive Bayes Adaptive - dynamically chooses between Naive Bayes and majority class based on which performs better at each leaf.
  • "nb": Naive Bayes - uses a Naive Bayes model at each leaf for smoothed predictions.
  • "mc": Majority Class - predicts the most frequent class observed at the leaf.

Memory management:

  • The tree monitors its own memory consumption at configurable intervals.
  • When the memory limit (max_size MiB) is exceeded, it can deactivate leaves or remove poor attributes.
  • The max_depth parameter limits tree depth to prevent excessive growth.

The algorithm handles both numeric features (via GaussianSplitter by default) and nominal (categorical) features. Missing split features are handled by routing to the most traversed subtree branch. Previously unseen categorical values in multi-way splits trigger creation of new branches.

Usage

Import this class when you need to:

  • Train a decision tree incrementally from a data stream.
  • Have an interpretable, non-linear classifier for streaming data.
  • Handle mixed feature types (numeric and categorical) in an online setting.
  • Use an algorithm with theoretical convergence guarantees.

Code Reference

Source Location

File Lines
river/tree/hoeffding_tree_classifier.py L13-L417

Signature

class HoeffdingTreeClassifier(HoeffdingTree, base.Classifier):
    def __init__(
        self,
        grace_period: int = 200,
        max_depth: int | None = None,
        split_criterion: str = "info_gain",
        delta: float = 1e-7,
        tau: float = 0.05,
        leaf_prediction: str = "nba",
        nb_threshold: int = 0,
        nominal_attributes: list | None = None,
        splitter: Splitter | None = None,
        binary_split: bool = False,
        min_branch_fraction: float = 0.01,
        max_share_to_split: float = 0.99,
        max_size: float = 100.0,
        memory_estimate_period: int = 1000000,
        stop_mem_management: bool = False,
        remove_poor_attrs: bool = False,
        merit_preprune: bool = True,
    )

    def learn_one(self, x: dict, y, *, w=1.0)
    def predict_proba_one(self, x: dict) -> dict

Import

from river import tree

model = tree.HoeffdingTreeClassifier()

I/O Contract

Inputs

Parameter Type Default Description
grace_period int 200 Number of observations between split attempts at each leaf.
max_depth None None Maximum tree depth. None allows unlimited growth.
split_criterion str "info_gain" Split criterion: "info_gain", "gini", or "hellinger".
delta float 1e-7 Significance level for the Hoeffding bound. Smaller values require more evidence before splitting.
tau float 0.05 Tie-breaking threshold. If the Hoeffding bound drops below this value, a split is forced.
leaf_prediction str "nba" Leaf prediction strategy: "nba", "nb", or "mc".
nb_threshold int 0 Minimum observations at a leaf before enabling Naive Bayes predictions.
nominal_attributes None None List of feature names to treat as categorical. If None, all features are treated as numeric.
splitter None GaussianSplitter() Attribute observer for computing split candidates. Must have is_target_class = True.
binary_split bool False If True, only binary splits are allowed.
min_branch_fraction float 0.01 Minimum fraction of data required per branch to validate a split.
max_share_to_split float 0.99 Maximum proportion of data in the majority class to allow splitting.
max_size float 100.0 Maximum tree size in mebibytes (MiB).
memory_estimate_period int 1000000 Interval (in observations) between memory consumption checks.
stop_mem_management bool False If True, stop growing when memory limit is reached.
remove_poor_attrs bool False If True, disable poor attributes to reduce memory usage.
merit_preprune bool True If True, enable merit-based pre-pruning.
x (to learn_one) dict (required) Feature dictionary.
y (to learn_one) any hashable (required) Class label.
w (to learn_one) float 1.0 Sample weight.

Outputs

Method Return Type Description
predict_proba_one(x) dict[ClfTarget, float] Dictionary mapping each observed class to its predicted probability.
predict_one(x) ClfTarget The class with the highest predicted probability (inherited from base.Classifier).

Usage Examples

Basic Hoeffding tree classification:

from river import datasets, evaluate, metrics, tree

gen = datasets.synth.Agrawal(classification_function=0, seed=42)
dataset = iter(gen.take(1000))

model = tree.HoeffdingTreeClassifier(
    grace_period=100,
    delta=1e-5,
    nominal_attributes=['elevel', 'car', 'zipcode'],
)

metric = metrics.Accuracy()
evaluate.progressive_val_score(dataset, model, metric)
# Accuracy: 84.58%

With custom split criterion and leaf prediction:

from river import tree

model = tree.HoeffdingTreeClassifier(
    split_criterion='gini',
    leaf_prediction='nb',
    grace_period=100,
    delta=1e-5,
)

In a pipeline with feature scaling:

from river import datasets, evaluate, metrics, preprocessing, tree

model = preprocessing.StandardScaler() | tree.HoeffdingTreeClassifier()
metric = metrics.Accuracy()

evaluate.progressive_val_score(datasets.Phishing(), model, metric)

Manual predict-then-learn loop:

from river import datasets, metrics, tree

model = tree.HoeffdingTreeClassifier()
metric = metrics.Accuracy()

for x, y in datasets.Phishing():
    y_pred = model.predict_one(x)
    if y_pred is not None:
        metric.update(y, y_pred)
    model.learn_one(x, y)

print(metric)

Examining probability distributions:

from river import datasets, tree

model = tree.HoeffdingTreeClassifier()

for i, (x, y) in enumerate(datasets.Phishing()):
    if i > 100:
        proba = model.predict_proba_one(x)
        print(proba)
        # {False: 0.35, True: 0.65}
        break
    model.learn_one(x, y)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment