Principle:Ggml org Ggml Optimizer Configuration

Overview

Optimizer Configuration addresses the setup and parameterization of gradient-based optimization algorithms used during neural network training. In GGML, this principally involves the AdamW optimizer (Adam with decoupled weight decay), though SGD (Stochastic Gradient Descent) is also supported.

The key configurable elements are:

Learning rate (alpha) -- controls the step size of parameter updates
Momentum parameters (beta1 / beta2) -- govern the exponential moving averages of the first and second moments of the gradient
Epsilon (eps) -- a small constant for numerical stability
Weight decay (wd) -- a regularization term that penalizes large weights independently of the gradient (decoupled from the adaptive learning rate)

Theory

AdamW Optimizer

AdamW is a variant of the Adam optimizer that decouples weight decay from the gradient-based update. At each timestep t, given gradient g_t:

First moment estimate (mean of gradients):

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t

Second moment estimate (mean of squared gradients):

v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2

Bias-corrected estimates:

m_hat = m_t / (1 - beta1^t)
v_hat = v_t / (1 - beta2^t)

Parameter update with decoupled weight decay:

theta = theta - lr * (m_hat / (sqrt(v_hat) + eps) + wd * theta)

Where:

beta1 (default 0.9) -- controls the decay rate of the first moment
beta2 (default 0.999) -- controls the decay rate of the second moment
eps (default 1e-8) -- prevents division by zero
lr (alpha, default 0.001) -- the learning rate
wd (default 0.0) -- weight decay coefficient

SGD

Stochastic Gradient Descent is also supported as a simpler alternative optimizer that updates parameters directly proportional to the gradient scaled by the learning rate.

Loss Types

The choice of loss function depends on the task:

Cross-entropy loss (GGML_OPT_LOSS_TYPE_CROSS_ENTROPY) -- used for classification tasks where the model outputs a probability distribution over discrete classes
Mean squared error loss (GGML_OPT_LOSS_TYPE_MEAN) -- used for regression tasks where the model predicts continuous values

Gradient Accumulation

Gradient accumulation allows gradients to be accumulated over multiple mini-batches before a single optimizer step is performed. This effectively increases the batch size without requiring proportionally more memory, which is particularly useful when training on hardware with limited memory.

Source

GGML

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment