Principle:Online ml River Online Optimizers

Knowledge Sources	Convex Optimization Deep Learning
Domains	Online_Learning, Optimization
Last Updated	2026-02-08 18:00 GMT

Overview

Online optimizers are algorithms that update model parameters incrementally using gradient information computed from one sample (or a small batch) at a time. They form the computational backbone of online learning systems, enabling models to adapt continuously without storing or revisiting past data.

The field has evolved from vanilla stochastic gradient descent to a rich ecosystem of adaptive methods that maintain per-parameter learning rates, incorporate momentum, and exploit second-order curvature information.

Theoretical Basis

Stochastic Gradient Descent (SGD)

The simplest online optimizer updates parameters w using the gradient of the loss on the current sample:

w_{t+1} = w_t - eta * g_t

where g_t = nabla L(w_t; x_t, y_t) and eta is the learning rate.

Momentum Methods

Momentum accumulates a running average of past gradients to smooth updates and accelerate convergence:

v_{t+1} = beta * v_t + g_t
w_{t+1} = w_t - eta * v_{t+1}

Nesterov momentum evaluates the gradient at the lookahead position w_t - eta * beta * v_t, yielding better convergence on convex problems.

Adaptive Learning Rate Methods

AdaGrad: Scales the learning rate by the inverse square root of accumulated squared gradients. Effective for sparse features but the learning rate monotonically decreases.
RMSProp: Uses an exponential moving average of squared gradients, preventing the aggressive decay of AdaGrad.
Adam: Combines momentum (first moment) with adaptive learning rates (second moment), with bias correction for the initial steps.
AdaMax: Replaces Adam's L2 norm of gradients with the L-infinity norm, providing more stable updates.
AMSGrad: Fixes a convergence issue in Adam by maintaining the maximum of past second moments.
AdaDelta: Eliminates the need for a global learning rate by using the ratio of running averages of parameter updates to gradients.
AdaBound: Clips adaptive learning rates to a dynamically narrowing range, transitioning from Adam-like to SGD-like behavior.
Nadam: Incorporates Nesterov momentum into the Adam framework.

Second-Order Methods

Newton's method uses the inverse Hessian to achieve faster convergence, updating as w_{t+1} = w_t - H_t^{-1} * g_t. Online approximations maintain a diagonal or low-rank estimate of the Hessian.

FTRL-Proximal

Follow The Regularized Leader is designed for online learning with sparsity-inducing regularization (L1). It produces sparse weight vectors, making it popular for large-scale online advertising and recommendation.

Weight Averaging

The Averager maintains a running average of all past parameter iterates, which can reduce variance and improve generalization compared to the final iterate.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment