Implementation:Online ml River Tree HoeffdingTreeClassifier
| Knowledge Sources | River River Docs Mining High-Speed Data Streams |
|---|---|
| Domains | Online_Learning Classification Decision_Trees |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Concrete tool for incremental decision tree classification using the Hoeffding bound for statistically grounded split decisions, supporting multiple split criteria and leaf prediction strategies.
Description
The tree.HoeffdingTreeClassifier class implements the Hoeffding Tree (also known as Very Fast Decision Tree or VFDT) algorithm for online classification. The tree grows incrementally: each leaf node collects sufficient statistics as observations arrive, and when enough evidence accumulates (as determined by the Hoeffding bound), the leaf is replaced by a split node.
The implementation provides extensive configurability:
Split criteria:
"info_gain"(default): Information gain based on entropy."gini": Gini impurity index."hellinger": Hellinger distance between class distributions.
Leaf prediction strategies:
"nba"(default): Naive Bayes Adaptive - dynamically chooses between Naive Bayes and majority class based on which performs better at each leaf."nb": Naive Bayes - uses a Naive Bayes model at each leaf for smoothed predictions."mc": Majority Class - predicts the most frequent class observed at the leaf.
Memory management:
- The tree monitors its own memory consumption at configurable intervals.
- When the memory limit (
max_sizeMiB) is exceeded, it can deactivate leaves or remove poor attributes. - The
max_depthparameter limits tree depth to prevent excessive growth.
The algorithm handles both numeric features (via GaussianSplitter by default) and nominal (categorical) features. Missing split features are handled by routing to the most traversed subtree branch. Previously unseen categorical values in multi-way splits trigger creation of new branches.
Usage
Import this class when you need to:
- Train a decision tree incrementally from a data stream.
- Have an interpretable, non-linear classifier for streaming data.
- Handle mixed feature types (numeric and categorical) in an online setting.
- Use an algorithm with theoretical convergence guarantees.
Code Reference
Source Location
| File | Lines |
|---|---|
river/tree/hoeffding_tree_classifier.py |
L13-L417 |
Signature
class HoeffdingTreeClassifier(HoeffdingTree, base.Classifier):
def __init__(
self,
grace_period: int = 200,
max_depth: int | None = None,
split_criterion: str = "info_gain",
delta: float = 1e-7,
tau: float = 0.05,
leaf_prediction: str = "nba",
nb_threshold: int = 0,
nominal_attributes: list | None = None,
splitter: Splitter | None = None,
binary_split: bool = False,
min_branch_fraction: float = 0.01,
max_share_to_split: float = 0.99,
max_size: float = 100.0,
memory_estimate_period: int = 1000000,
stop_mem_management: bool = False,
remove_poor_attrs: bool = False,
merit_preprune: bool = True,
)
def learn_one(self, x: dict, y, *, w=1.0)
def predict_proba_one(self, x: dict) -> dict
Import
from river import tree
model = tree.HoeffdingTreeClassifier()
I/O Contract
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
grace_period |
int |
200 |
Number of observations between split attempts at each leaf. |
max_depth |
None | None |
Maximum tree depth. None allows unlimited growth.
|
split_criterion |
str |
"info_gain" |
Split criterion: "info_gain", "gini", or "hellinger".
|
delta |
float |
1e-7 |
Significance level for the Hoeffding bound. Smaller values require more evidence before splitting. |
tau |
float |
0.05 |
Tie-breaking threshold. If the Hoeffding bound drops below this value, a split is forced. |
leaf_prediction |
str |
"nba" |
Leaf prediction strategy: "nba", "nb", or "mc".
|
nb_threshold |
int |
0 |
Minimum observations at a leaf before enabling Naive Bayes predictions. |
nominal_attributes |
None | None |
List of feature names to treat as categorical. If None, all features are treated as numeric.
|
splitter |
None | GaussianSplitter() |
Attribute observer for computing split candidates. Must have is_target_class = True.
|
binary_split |
bool |
False |
If True, only binary splits are allowed.
|
min_branch_fraction |
float |
0.01 |
Minimum fraction of data required per branch to validate a split. |
max_share_to_split |
float |
0.99 |
Maximum proportion of data in the majority class to allow splitting. |
max_size |
float |
100.0 |
Maximum tree size in mebibytes (MiB). |
memory_estimate_period |
int |
1000000 |
Interval (in observations) between memory consumption checks. |
stop_mem_management |
bool |
False |
If True, stop growing when memory limit is reached.
|
remove_poor_attrs |
bool |
False |
If True, disable poor attributes to reduce memory usage.
|
merit_preprune |
bool |
True |
If True, enable merit-based pre-pruning.
|
x (to learn_one) |
dict |
(required) | Feature dictionary. |
y (to learn_one) |
any hashable | (required) | Class label. |
w (to learn_one) |
float |
1.0 |
Sample weight. |
Outputs
| Method | Return Type | Description |
|---|---|---|
predict_proba_one(x) |
dict[ClfTarget, float] |
Dictionary mapping each observed class to its predicted probability. |
predict_one(x) |
ClfTarget |
The class with the highest predicted probability (inherited from base.Classifier). |
Usage Examples
Basic Hoeffding tree classification:
from river import datasets, evaluate, metrics, tree
gen = datasets.synth.Agrawal(classification_function=0, seed=42)
dataset = iter(gen.take(1000))
model = tree.HoeffdingTreeClassifier(
grace_period=100,
delta=1e-5,
nominal_attributes=['elevel', 'car', 'zipcode'],
)
metric = metrics.Accuracy()
evaluate.progressive_val_score(dataset, model, metric)
# Accuracy: 84.58%
With custom split criterion and leaf prediction:
from river import tree
model = tree.HoeffdingTreeClassifier(
split_criterion='gini',
leaf_prediction='nb',
grace_period=100,
delta=1e-5,
)
In a pipeline with feature scaling:
from river import datasets, evaluate, metrics, preprocessing, tree
model = preprocessing.StandardScaler() | tree.HoeffdingTreeClassifier()
metric = metrics.Accuracy()
evaluate.progressive_val_score(datasets.Phishing(), model, metric)
Manual predict-then-learn loop:
from river import datasets, metrics, tree
model = tree.HoeffdingTreeClassifier()
metric = metrics.Accuracy()
for x, y in datasets.Phishing():
y_pred = model.predict_one(x)
if y_pred is not None:
metric.update(y, y_pred)
model.learn_one(x, y)
print(metric)
Examining probability distributions:
from river import datasets, tree
model = tree.HoeffdingTreeClassifier()
for i, (x, y) in enumerate(datasets.Phishing()):
if i > 100:
proba = model.predict_proba_one(x)
print(proba)
# {False: 0.35, True: 0.65}
break
model.learn_one(x, y)