Principle:Ggml_org_Ggml_Model_Training
Summary
Training neural networks by iterating over data epochs with gradient-based optimization. The training loop orchestrates the core cycle of forward pass, loss computation, backward pass, and parameter update, repeating this cycle across multiple epochs over the full dataset until the model converges.
Theory
Neural network training is an iterative optimization process that adjusts model parameters to minimize a loss function. The process is structured around several fundamental concepts that work together to produce a trained model.
Training Loop
The training loop is the central control structure. Each iteration of the inner loop processes one batch of data through four sequential stages:
- Forward pass -- Input data flows through the network layers, producing predictions.
- Loss computation -- The predictions are compared against ground-truth labels using a loss function (e.g., cross-entropy for classification), yielding a scalar loss value.
- Backward pass -- Gradients of the loss with respect to every trainable parameter are computed via back-propagation.
- Parameter update -- An optimizer (e.g., AdamW) uses the computed gradients to adjust the model weights in the direction that reduces the loss.
for epoch in 1..nepoch:
shuffle(dataset)
for batch in dataset.batches():
predictions = forward(batch.inputs)
loss = compute_loss(predictions, batch.labels)
gradients = backward(loss)
update_weights(parameters, gradients)
Epochs
An epoch is one complete pass through the entire training dataset. Training typically runs for multiple epochs because a single pass is rarely sufficient for the optimizer to find a good minimum. The number of epochs (nepoch) is a hyperparameter that controls how long training runs.
Batching
The full dataset is divided into batches (also called mini-batches). Rather than computing the gradient over the entire dataset (full-batch gradient descent) or a single sample (pure stochastic gradient descent), mini-batch processing provides a practical compromise:
- Computational efficiency -- Batch operations exploit hardware parallelism (SIMD, GPU warps).
- Gradient stability -- Averaging gradients over a batch reduces noise compared to single-sample updates.
- Memory feasibility -- Batches fit in device memory when the full dataset does not.
Gradient Accumulation
When device memory is too limited for a large batch, gradient accumulation provides a workaround. Gradients are accumulated over multiple smaller (physical) batches before performing a single parameter update, effectively simulating a larger logical batch size without increasing peak memory usage:
logical_batch_size = nbatch_logical
physical_batch_size = ndata_shard
accumulation_steps = logical_batch_size / physical_batch_size
for step in 1..accumulation_steps:
batch = next_shard()
loss = forward_and_loss(batch)
gradients = backward(loss)
accumulate(gradients)
update_weights(parameters, accumulated_gradients)
Validation Split
A portion of the dataset is held out from training and used exclusively for validation. After each epoch (or periodically during training), the model is evaluated on this validation set using only forward passes (no gradient computation). Validation performance monitors generalization -- the model's ability to perform well on data it has not been trained on -- and helps detect overfitting.
| Split | Purpose | Gradient Computation |
|---|---|---|
| Training | Update model parameters | Yes (forward + backward) |
| Validation | Monitor generalization | No (forward only) |
The fraction of data reserved for validation is controlled by the val_split parameter (e.g., 0.05 holds out 5% of the data).
Core Concepts
- Training loop -- The outer structure that repeats the forward-backward-update cycle across epochs and batches until the model converges.
- Epoch -- One complete pass through the training dataset; the fundamental unit of training progress.
- Mini-batch processing -- Dividing the dataset into fixed-size batches to balance gradient accuracy, memory usage, and hardware utilization.
- Gradient accumulation -- Accumulating gradients over multiple physical batches to simulate a larger logical batch size without exceeding device memory.
- Validation split -- Reserving a fraction of the data for evaluation-only passes to monitor generalization and detect overfitting.
- Shuffling -- Randomly reordering the dataset (or its shards) before each epoch to prevent the optimizer from exploiting data ordering patterns.
Key Operations
| Operation | Description |
|---|---|
| Shuffle dataset | Randomly permute shards before each epoch to improve generalization and reduce ordering bias. |
| Forward pass | Propagate input tensors through the computation graph to produce output predictions. |
| Compute loss | Evaluate the loss function (e.g., cross-entropy) comparing predictions against ground-truth labels. |
| Backward pass | Compute gradients of the loss with respect to all trainable parameters via back-propagation. |
| Update weights | Apply the optimizer (e.g., AdamW) to adjust parameters using the accumulated gradients. |
| Validate | Run forward-only evaluation on held-out data to measure loss and accuracy without updating weights. |
Problem Solved
Without a structured training loop, the process of iteratively optimizing model parameters would require manually orchestrating data shuffling, batching, gradient computation, accumulation, parameter updates, and validation evaluation. The Model Training principle encapsulates this entire workflow into a coherent, repeatable process that:
- Iterates over the dataset for a configurable number of epochs, shuffling data each time.
- Handles mini-batch iteration with support for gradient accumulation to decouple logical batch size from physical memory constraints.
- Automatically partitions data into training and validation splits based on a single parameter.
- Provides progress monitoring through loss and accuracy metrics on both training and validation sets.