Principle:Ggml_org_Ggml_Model_Training

Summary

Training neural networks by iterating over data epochs with gradient-based optimization. The training loop orchestrates the core cycle of forward pass, loss computation, backward pass, and parameter update, repeating this cycle across multiple epochs over the full dataset until the model converges.

Theory

Neural network training is an iterative optimization process that adjusts model parameters to minimize a loss function. The process is structured around several fundamental concepts that work together to produce a trained model.

Training Loop

The training loop is the central control structure. Each iteration of the inner loop processes one batch of data through four sequential stages:

Forward pass -- Input data flows through the network layers, producing predictions.
Loss computation -- The predictions are compared against ground-truth labels using a loss function (e.g., cross-entropy for classification), yielding a scalar loss value.
Backward pass -- Gradients of the loss with respect to every trainable parameter are computed via back-propagation.
Parameter update -- An optimizer (e.g., AdamW) uses the computed gradients to adjust the model weights in the direction that reduces the loss.

for epoch in 1..nepoch:
    shuffle(dataset)
    for batch in dataset.batches():
        predictions = forward(batch.inputs)
        loss        = compute_loss(predictions, batch.labels)
        gradients   = backward(loss)
        update_weights(parameters, gradients)

Epochs

An epoch is one complete pass through the entire training dataset. Training typically runs for multiple epochs because a single pass is rarely sufficient for the optimizer to find a good minimum. The number of epochs (nepoch) is a hyperparameter that controls how long training runs.

Batching

The full dataset is divided into batches (also called mini-batches). Rather than computing the gradient over the entire dataset (full-batch gradient descent) or a single sample (pure stochastic gradient descent), mini-batch processing provides a practical compromise:

Computational efficiency -- Batch operations exploit hardware parallelism (SIMD, GPU warps).
Gradient stability -- Averaging gradients over a batch reduces noise compared to single-sample updates.
Memory feasibility -- Batches fit in device memory when the full dataset does not.

Gradient Accumulation

When device memory is too limited for a large batch, gradient accumulation provides a workaround. Gradients are accumulated over multiple smaller (physical) batches before performing a single parameter update, effectively simulating a larger logical batch size without increasing peak memory usage:

logical_batch_size   = nbatch_logical
physical_batch_size  = ndata_shard

accumulation_steps   = logical_batch_size / physical_batch_size

for step in 1..accumulation_steps:
    batch     = next_shard()
    loss      = forward_and_loss(batch)
    gradients = backward(loss)
    accumulate(gradients)

update_weights(parameters, accumulated_gradients)

Validation Split

A portion of the dataset is held out from training and used exclusively for validation. After each epoch (or periodically during training), the model is evaluated on this validation set using only forward passes (no gradient computation). Validation performance monitors generalization -- the model's ability to perform well on data it has not been trained on -- and helps detect overfitting.

Split	Purpose	Gradient Computation
Training	Update model parameters	Yes (forward + backward)
Validation	Monitor generalization	No (forward only)

The fraction of data reserved for validation is controlled by the val_split parameter (e.g., 0.05 holds out 5% of the data).

Core Concepts

Training loop -- The outer structure that repeats the forward-backward-update cycle across epochs and batches until the model converges.
Epoch -- One complete pass through the training dataset; the fundamental unit of training progress.
Mini-batch processing -- Dividing the dataset into fixed-size batches to balance gradient accuracy, memory usage, and hardware utilization.
Gradient accumulation -- Accumulating gradients over multiple physical batches to simulate a larger logical batch size without exceeding device memory.
Validation split -- Reserving a fraction of the data for evaluation-only passes to monitor generalization and detect overfitting.
Shuffling -- Randomly reordering the dataset (or its shards) before each epoch to prevent the optimizer from exploiting data ordering patterns.

Key Operations

Operation	Description
Shuffle dataset	Randomly permute shards before each epoch to improve generalization and reduce ordering bias.
Forward pass	Propagate input tensors through the computation graph to produce output predictions.
Compute loss	Evaluate the loss function (e.g., cross-entropy) comparing predictions against ground-truth labels.
Backward pass	Compute gradients of the loss with respect to all trainable parameters via back-propagation.
Update weights	Apply the optimizer (e.g., AdamW) to adjust parameters using the accumulated gradients.
Validate	Run forward-only evaluation on held-out data to measure loss and accuracy without updating weights.

Problem Solved

Without a structured training loop, the process of iteratively optimizing model parameters would require manually orchestrating data shuffling, batching, gradient computation, accumulation, parameter updates, and validation evaluation. The Model Training principle encapsulates this entire workflow into a coherent, repeatable process that:

Iterates over the dataset for a configurable number of epochs, shuffling data each time.
Handles mini-batch iteration with support for gradient accumulation to decouple logical batch size from physical memory constraints.
Automatically partitions data into training and validation splits based on a single parameter.
Provides progress monitoring through loss and accuracy metrics on both training and validation sets.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment