Principle:Online ml River Optimizer Configuration
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Optimization |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Optimizer configuration encompasses the supporting components that govern how online optimizers behave: weight initialization, loss functions, and learning rate schedulers. These components are orthogonal to the choice of optimizer itself but profoundly affect convergence speed, stability, and final model quality.
Proper configuration is especially important in the online setting, where there is no opportunity to restart training or revisit earlier data if the initial configuration is poor.
Theoretical Basis
Weight Initialization
The initial values of model parameters determine the starting point in the loss landscape. Poor initialization can cause:
- Symmetry problems: Identical initial weights lead to identical gradient updates across neurons.
- Vanishing/exploding signals: Very small or very large initial weights cause activations and gradients to shrink or blow up through layers.
Common strategies include:
- Zero initialization: Simple but breaks symmetry only when combined with biases.
- Uniform/Normal random: Draws from a distribution scaled to the layer dimensions (e.g., Xavier, He initialization).
- Constant initialization: Sets all parameters to a fixed non-zero value.
Loss Functions
The loss function L(y_hat, y) quantifies prediction error and defines the gradient signal used by the optimizer. Standard choices include:
- Squared loss: (y_hat - y)^2 -- standard for regression; sensitive to outliers.
- Absolute loss: |y_hat - y| -- more robust to outliers but non-differentiable at zero.
- Cross-entropy loss: -sum y_k * log(y_hat_k) -- standard for classification; measures divergence between predicted and true distributions.
- Hinge loss: max(0, 1 - y * y_hat) -- used in support vector machines; encourages a margin.
- Log loss: log(1 + exp(-y * y_hat)) -- smooth approximation to hinge loss.
Learning Rate Scheduling
A fixed learning rate is rarely optimal. Schedulers adjust the learning rate eta_t over time:
- Constant: eta_t = eta_0 -- simplest, but may be too aggressive early or too conservative late.
- Inverse scaling: eta_t = eta_0 / t^p -- decays as a power law; guarantees convergence under mild conditions.
- Exponential decay: eta_t = eta_0 * gamma^t -- geometric reduction at each step.
- Optimal (Robbins-Monro): eta_t = eta_0 / (1 + eta_0 * lambda * t) -- theoretically grounded for strongly convex losses.
The Robbins-Monro conditions for convergence require that sum eta_t = infinity and sum eta_t^2 < infinity.