Principle:Online ml River Online Optimizers
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Optimization |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Online optimizers are algorithms that update model parameters incrementally using gradient information computed from one sample (or a small batch) at a time. They form the computational backbone of online learning systems, enabling models to adapt continuously without storing or revisiting past data.
The field has evolved from vanilla stochastic gradient descent to a rich ecosystem of adaptive methods that maintain per-parameter learning rates, incorporate momentum, and exploit second-order curvature information.
Theoretical Basis
Stochastic Gradient Descent (SGD)
The simplest online optimizer updates parameters w using the gradient of the loss on the current sample:
w_{t+1} = w_t - eta * g_t
where g_t = nabla L(w_t; x_t, y_t) and eta is the learning rate.
Momentum Methods
Momentum accumulates a running average of past gradients to smooth updates and accelerate convergence:
v_{t+1} = beta * v_t + g_t
w_{t+1} = w_t - eta * v_{t+1}
Nesterov momentum evaluates the gradient at the lookahead position w_t - eta * beta * v_t, yielding better convergence on convex problems.
Adaptive Learning Rate Methods
- AdaGrad: Scales the learning rate by the inverse square root of accumulated squared gradients. Effective for sparse features but the learning rate monotonically decreases.
- RMSProp: Uses an exponential moving average of squared gradients, preventing the aggressive decay of AdaGrad.
- Adam: Combines momentum (first moment) with adaptive learning rates (second moment), with bias correction for the initial steps.
- AdaMax: Replaces Adam's L2 norm of gradients with the L-infinity norm, providing more stable updates.
- AMSGrad: Fixes a convergence issue in Adam by maintaining the maximum of past second moments.
- AdaDelta: Eliminates the need for a global learning rate by using the ratio of running averages of parameter updates to gradients.
- AdaBound: Clips adaptive learning rates to a dynamically narrowing range, transitioning from Adam-like to SGD-like behavior.
- Nadam: Incorporates Nesterov momentum into the Adam framework.
Second-Order Methods
Newton's method uses the inverse Hessian to achieve faster convergence, updating as w_{t+1} = w_t - H_t^{-1} * g_t. Online approximations maintain a diagonal or low-rank estimate of the Hessian.
FTRL-Proximal
Follow The Regularized Leader is designed for online learning with sparsity-inducing regularization (L1). It produces sparse weight vectors, making it popular for large-scale online advertising and recommendation.
Weight Averaging
The Averager maintains a running average of all past parameter iterates, which can reduce variance and improve generalization compared to the final iterate.
Related Pages
- Implementation:Online_ml_River_Optim_AMSGrad
- Implementation:Online_ml_River_Optim_AdaBound
- Implementation:Online_ml_River_Optim_AdaDelta
- Implementation:Online_ml_River_Optim_AdaGrad
- Implementation:Online_ml_River_Optim_AdaMax
- Implementation:Online_ml_River_Optim_Adam
- Implementation:Online_ml_River_Optim_Averager
- Implementation:Online_ml_River_Optim_FTRLProximal
- Implementation:Online_ml_River_Optim_Momentum
- Implementation:Online_ml_River_Optim_Nadam
- Implementation:Online_ml_River_Optim_NesterovMomentum
- Implementation:Online_ml_River_Optim_Newton
- Implementation:Online_ml_River_Optim_RMSProp
- Implementation:Online_ml_River_Optim_SGD