Principle:Online ml River Time Series Evaluation
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs | Online Machine Learning, Time Series Forecasting, Model Evaluation | 2026-02-08 16:00 GMT |
Overview
Evaluation protocol for streaming time series forecasters that computes per-horizon-step metrics using a walk-forward validation approach.
Description
Evaluating online forecasting models requires a fundamentally different approach from batch evaluation. Instead of a single train/test split, walk-forward validation (also known as progressive validation) evaluates the model at every time step: the model produces a multi-step forecast, the true future values are observed, the per-step metrics are updated, and then the model learns from the current observation.
River's time_series.evaluate function implements this protocol. At each time step t:
- The model produces a forecast of h future values:
[hat{y}_{t+1}, ..., hat{y}_{t+h}] - The true future values
[y_{t+1}, ..., y_{t+h}]are available (via a look-ahead buffer) - A
HorizonMetricupdates separate metric instances for each horizon step - The model learns from the current observation via
learn_one(y_t, x_t)
This approach provides a realistic assessment of how a forecaster would perform in production, where it must continuously make predictions on unseen data while adapting to new observations.
The evaluation supports:
- Per-horizon-step metrics: Separate performance measurement at each forecast distance (+1, +2, ..., +h)
- Aggregated metrics: Optional aggregation function (e.g., mean) to collapse per-step metrics into a single scalar
- Grace period: Initial warmup period during which the metric is not updated, allowing models to accumulate enough history for meaningful predictions
Usage
Use time series evaluation when:
- You need to measure how well a forecaster performs at different prediction horizons
- You want a fair comparison between forecasters on streaming data
- You need to understand whether a model's accuracy degrades as the forecast horizon increases
- You want to assess model performance after a warmup period
Theoretical Basis
Walk-Forward Validation Protocol
The evaluation follows this pseudocode:
Input: dataset, model, metric, horizon h, grace_period g
1. Buffer the first h observations in a look-ahead window
2. For each step t during the grace period (g steps):
a. model.learn_one(y_t, x_t) # Warmup only, no metric update
3. For each subsequent step t:
a. y_pred = model.forecast(h) # Produce h-step forecast
b. horizon_metric.update(y_true=[y_{t+1},...,y_{t+h}], y_pred)
c. model.learn_one(y_t, x_t) # Update model AFTER evaluation
Return: horizon_metric
Key design decisions:
- Evaluate before learn: The model is evaluated on data it has not yet seen, then updated. This prevents information leakage.
- Grace period: Defaults to the horizon value if not specified. This allows models like SNARIMAX (which need enough lags to populate) to warm up.
- Look-ahead buffer: A sliding window of size h maintains the future true values needed for evaluation at each step.
Per-Horizon-Step Metrics
The evaluation uses HorizonMetric to maintain separate metric instances for each step of the forecast horizon. If the base metric is MAE:
+1 MAE: value_1 (1-step-ahead accuracy)
+2 MAE: value_2 (2-step-ahead accuracy)
...
+h MAE: value_h (h-step-ahead accuracy)
Typically, metrics degrade with increasing horizon distance because uncertainty compounds.
Aggregated Metrics
When an agg_func (e.g., statistics.mean) is provided, the evaluation produces a single scalar summarizing performance across all horizon steps. This is useful for hyperparameter tuning or model selection.