Principle:Online ml River Feature Standardization

Knowledge Sources	River River Docs
Domains	Online_Learning Feature_Engineering Statistics
Last Updated	2026-02-08 16:00 GMT

Overview

Feature standardization is a statistical technique that transforms features to have zero mean and unit variance using incrementally maintained running statistics.

Description

Many machine learning algorithms, particularly gradient-based methods like logistic regression and neural networks, are sensitive to the scale of input features. Features with large magnitudes can dominate the gradient, leading to slow convergence or numerical instability. Feature standardization addresses this by transforming each feature so that it has a mean of zero and a standard deviation of one.

In the batch setting, standardization is straightforward: compute the mean and standard deviation over the entire dataset, then subtract the mean and divide by the standard deviation. In the online (streaming) setting, however, the full dataset is never available at once. Instead, running statistics must be maintained and updated incrementally as each new observation arrives.

River's StandardScaler uses Welford's online algorithm to maintain running estimates of the mean and variance for each feature. This algorithm is numerically stable and computes exact results (not approximations) given the data seen so far. The transformation at any point in time uses the current running statistics, which means early predictions may be less accurate but converge to batch-equivalent quality as more data is observed.

The scaler also supports mini-batch updates, where a group of observations is used to update the running statistics simultaneously. This uses a batch-compatible update formula that correctly merges the statistics of the existing state with the statistics of the incoming batch.

Usage

Use feature standardization when:

You are using gradient-based models (logistic regression, linear regression, neural networks) that are sensitive to feature scale.
Features have different units or ranges (e.g., age in years vs. income in dollars).
You want to improve the convergence speed of stochastic gradient descent.
You are building a pipeline where the scaler precedes a classifier or regressor.

Theoretical Basis

Welford's online algorithm computes running mean and variance in a single pass with O(1) memory per feature. For each feature i, after observing the n-th value $x_{n}$ :

Mean update:

mean_new = mean_old + (x_n - mean_old) / n

Variance update (using the online formula):

var_new = var_old + ((x_n - mean_old) * (x_n - mean_new) - var_old) / n

Note that var here is the population variance (not the sample variance). The standard deviation is then $σ = \sqrt{var}$ .

Standardization transform:

z = (x - mean) / std

If the standard deviation is zero (feature has no variance), the transformed value is set to 0.0 to avoid division by zero.

Mini-batch update: When a batch of $m$ new observations arrives with batch mean ${\bar{x}}_{new}$ and batch variance $s_{new}^{2}$ , the combined statistics are:

a = n_old / (n_old + m)
b = m / (n_old + m)
mean_combined = a * mean_old + b * mean_new
var_combined  = a * var_old + b * var_new + a * b * (mean_old - mean_new)^2

This formula correctly accounts for the variance introduced by the difference in means between the existing and new data.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment