Principle:Online ml River Time Series Evaluation

Knowledge Sources	Domains	Last Updated
River River Docs	Online Machine Learning, Time Series Forecasting, Model Evaluation	2026-02-08 16:00 GMT

Overview

Evaluation protocol for streaming time series forecasters that computes per-horizon-step metrics using a walk-forward validation approach.

Description

Evaluating online forecasting models requires a fundamentally different approach from batch evaluation. Instead of a single train/test split, walk-forward validation (also known as progressive validation) evaluates the model at every time step: the model produces a multi-step forecast, the true future values are observed, the per-step metrics are updated, and then the model learns from the current observation.

River's time_series.evaluate function implements this protocol. At each time step t:

The model produces a forecast of h future values: [hat{y}_{t+1}, ..., hat{y}_{t+h}]
The true future values [y_{t+1}, ..., y_{t+h}] are available (via a look-ahead buffer)
A HorizonMetric updates separate metric instances for each horizon step
The model learns from the current observation via learn_one(y_t, x_t)

This approach provides a realistic assessment of how a forecaster would perform in production, where it must continuously make predictions on unseen data while adapting to new observations.

The evaluation supports:

Per-horizon-step metrics: Separate performance measurement at each forecast distance (+1, +2, ..., +h)
Aggregated metrics: Optional aggregation function (e.g., mean) to collapse per-step metrics into a single scalar
Grace period: Initial warmup period during which the metric is not updated, allowing models to accumulate enough history for meaningful predictions

Usage

Use time series evaluation when:

You need to measure how well a forecaster performs at different prediction horizons
You want a fair comparison between forecasters on streaming data
You need to understand whether a model's accuracy degrades as the forecast horizon increases
You want to assess model performance after a warmup period

Theoretical Basis

Walk-Forward Validation Protocol

The evaluation follows this pseudocode:

Input: dataset, model, metric, horizon h, grace_period g

1. Buffer the first h observations in a look-ahead window
2. For each step t during the grace period (g steps):
   a. model.learn_one(y_t, x_t)    # Warmup only, no metric update
3. For each subsequent step t:
   a. y_pred = model.forecast(h)     # Produce h-step forecast
   b. horizon_metric.update(y_true=[y_{t+1},...,y_{t+h}], y_pred)
   c. model.learn_one(y_t, x_t)     # Update model AFTER evaluation

Return: horizon_metric

Key design decisions:

Evaluate before learn: The model is evaluated on data it has not yet seen, then updated. This prevents information leakage.
Grace period: Defaults to the horizon value if not specified. This allows models like SNARIMAX (which need enough lags to populate) to warm up.
Look-ahead buffer: A sliding window of size h maintains the future true values needed for evaluation at each step.

Per-Horizon-Step Metrics

The evaluation uses HorizonMetric to maintain separate metric instances for each step of the forecast horizon. If the base metric is MAE:

+1  MAE: value_1     (1-step-ahead accuracy)
+2  MAE: value_2     (2-step-ahead accuracy)
...
+h  MAE: value_h     (h-step-ahead accuracy)

Typically, metrics degrade with increasing horizon distance because uncertainty compounds.

Aggregated Metrics

When an agg_func (e.g., statistics.mean) is provided, the evaluation produces a single scalar summarizing performance across all horizon steps. This is useful for hyperparameter tuning or model selection.

Related Pages

Implementation:Online_ml_River_Time_Series_Evaluate_Func

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment