Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Ggml Optimizer Configuration

From Leeroopedia
Revision as of 17:44, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Ggml_org_Ggml_Optimizer_Configuration.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Template:Principle

Overview

Optimizer Configuration addresses the setup and parameterization of gradient-based optimization algorithms used during neural network training. In GGML, this principally involves the AdamW optimizer (Adam with decoupled weight decay), though SGD (Stochastic Gradient Descent) is also supported.

The key configurable elements are:

  • Learning rate (alpha) -- controls the step size of parameter updates
  • Momentum parameters (beta1 / beta2) -- govern the exponential moving averages of the first and second moments of the gradient
  • Epsilon (eps) -- a small constant for numerical stability
  • Weight decay (wd) -- a regularization term that penalizes large weights independently of the gradient (decoupled from the adaptive learning rate)

Theory

AdamW Optimizer

AdamW is a variant of the Adam optimizer that decouples weight decay from the gradient-based update. At each timestep t, given gradient g_t:

First moment estimate (mean of gradients):

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t

Second moment estimate (mean of squared gradients):

v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2

Bias-corrected estimates:

m_hat = m_t / (1 - beta1^t)
v_hat = v_t / (1 - beta2^t)

Parameter update with decoupled weight decay:

theta = theta - lr * (m_hat / (sqrt(v_hat) + eps) + wd * theta)

Where:

  • beta1 (default 0.9) -- controls the decay rate of the first moment
  • beta2 (default 0.999) -- controls the decay rate of the second moment
  • eps (default 1e-8) -- prevents division by zero
  • lr (alpha, default 0.001) -- the learning rate
  • wd (default 0.0) -- weight decay coefficient

SGD

Stochastic Gradient Descent is also supported as a simpler alternative optimizer that updates parameters directly proportional to the gradient scaled by the learning rate.

Loss Types

The choice of loss function depends on the task:

  • Cross-entropy loss (GGML_OPT_LOSS_TYPE_CROSS_ENTROPY) -- used for classification tasks where the model outputs a probability distribution over discrete classes
  • Mean squared error loss (GGML_OPT_LOSS_TYPE_MEAN) -- used for regression tasks where the model predicts continuous values

Gradient Accumulation

Gradient accumulation allows gradients to be accumulated over multiple mini-batches before a single optimizer step is performed. This effectively increases the batch size without requiring proportionally more memory, which is particularly useful when training on hardware with limited memory.

Related

Source

GGML

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment