Principle:Ggml org Ggml Optimizer Configuration
Overview
Optimizer Configuration addresses the setup and parameterization of gradient-based optimization algorithms used during neural network training. In GGML, this principally involves the AdamW optimizer (Adam with decoupled weight decay), though SGD (Stochastic Gradient Descent) is also supported.
The key configurable elements are:
- Learning rate (alpha) -- controls the step size of parameter updates
- Momentum parameters (beta1 / beta2) -- govern the exponential moving averages of the first and second moments of the gradient
- Epsilon (eps) -- a small constant for numerical stability
- Weight decay (wd) -- a regularization term that penalizes large weights independently of the gradient (decoupled from the adaptive learning rate)
Theory
AdamW Optimizer
AdamW is a variant of the Adam optimizer that decouples weight decay from the gradient-based update. At each timestep t, given gradient g_t:
First moment estimate (mean of gradients):
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
Second moment estimate (mean of squared gradients):
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2
Bias-corrected estimates:
m_hat = m_t / (1 - beta1^t) v_hat = v_t / (1 - beta2^t)
Parameter update with decoupled weight decay:
theta = theta - lr * (m_hat / (sqrt(v_hat) + eps) + wd * theta)
Where:
- beta1 (default 0.9) -- controls the decay rate of the first moment
- beta2 (default 0.999) -- controls the decay rate of the second moment
- eps (default 1e-8) -- prevents division by zero
- lr (alpha, default 0.001) -- the learning rate
- wd (default 0.0) -- weight decay coefficient
SGD
Stochastic Gradient Descent is also supported as a simpler alternative optimizer that updates parameters directly proportional to the gradient scaled by the learning rate.
Loss Types
The choice of loss function depends on the task:
- Cross-entropy loss (
GGML_OPT_LOSS_TYPE_CROSS_ENTROPY) -- used for classification tasks where the model outputs a probability distribution over discrete classes - Mean squared error loss (
GGML_OPT_LOSS_TYPE_MEAN) -- used for regression tasks where the model predicts continuous values
Gradient Accumulation
Gradient accumulation allows gradients to be accumulated over multiple mini-batches before a single optimizer step is performed. This effectively increases the batch size without requiring proportionally more memory, which is particularly useful when training on hardware with limited memory.