Heuristic:Ggml org Ggml Gradient Accumulation Batch Sizing

Knowledge Sources	GGML MNIST training example
Domains	Training, Optimization
Last Updated	2026-02-10 07:40 GMT

Overview

Use logical batch size as a multiple of physical batch size to achieve gradient accumulation, trading memory for effective batch size (MNIST uses 1000 logical / 500 physical).

Description

GGML's training framework distinguishes between logical batch size (how many samples contribute to one gradient update) and physical batch size (how many samples are processed in parallel on hardware). When the logical batch exceeds the physical batch, gradients are accumulated across multiple forward/backward passes before the optimizer step. This enables training with large effective batch sizes even on memory-constrained hardware.

Additionally, the MNIST example demonstrates that for small models, data shuffling overhead can dominate training time, especially with CUDA. Using shard-based shuffling (shuffling groups of 10 samples instead of individual samples) reduces this overhead while maintaining sufficient randomization.

Usage

Use this heuristic when training models with GGML's optimizer framework and you need a larger effective batch size than your hardware memory allows. It is also relevant when you observe that data shuffling overhead is significant compared to compute time (common with small models on GPUs).

The Insight (Rule of Thumb)

Action: Set `NBATCH_LOGICAL` to a multiple of `NBATCH_PHYSICAL`. Ensure both divide evenly into the training set size.
Value: MNIST example uses logical=1000, physical=500 (2x accumulation). Shard size of 10 for shuffling.
Trade-off: Larger physical batch uses more memory but better utilizes compute. Larger logical batch gives smoother gradients but slower convergence per sample.
Constraint: `NBATCH_LOGICAL % NBATCH_PHYSICAL == 0` and `NTRAIN % NBATCH_LOGICAL == 0` are enforced by static assertions.

Reasoning

The separation of logical and physical batch sizes is a standard deep learning technique. In GGML's implementation, the static assertions ensure correct division:

Gradient quality: A logical batch of 1000 provides stable gradient estimates for MNIST's 60000-sample training set (60 updates per epoch).
Memory control: A physical batch of 500 processes half the logical batch at a time, halving peak memory usage.
Shuffling overhead: For small models like MNIST, the time to shuffle 60000 individual indices is non-negligible compared to forward/backward passes. Shuffling shards of 10 reduces overhead by 10x while keeping sufficient randomness.
Validation split: 5% validation (3000 samples) is conservative but adequate for monitoring overfitting on MNIST.

Code Evidence

Batch size definitions and constraints from `examples/mnist/mnist-common.h:18-26`:

// Gradient accumulation can be achieved by setting the logical batch size
// to a multiple of the physical one.
// The logical batch size determines how many datapoints are used for a
// gradient update.
// The physical batch size determines how many datapoints are processed in
// parallel, larger values utilize compute better but need more memory.
#define MNIST_NBATCH_LOGICAL  1000
#define MNIST_NBATCH_PHYSICAL  500

static_assert(MNIST_NBATCH_LOGICAL % MNIST_NBATCH_PHYSICAL == 0,
    "MNIST_NBATCH_LOGICAL % MNIST_NBATCH_PHYSICAL != 0");
static_assert(MNIST_NTRAIN % MNIST_NBATCH_LOGICAL == 0,
    "MNIST_NTRAIN % MNIST_NBATCH_LOGICAL != 0");

Shard-based shuffling from `examples/mnist/mnist-train.cpp:20-23`:

// The MNIST model is so small that the overhead from data shuffling is
// non-negligible, especially with CUDA.
// With a shard size of 10 this overhead is greatly reduced at the cost of
// less shuffling (does not seem to have a significant impact).
ggml_opt_dataset_t dataset = ggml_opt_dataset_init(
    GGML_TYPE_F32, GGML_TYPE_F32, MNIST_NINPUT, MNIST_NCLASSES,
    MNIST_NTRAIN, /*ndata_shard =*/ 10);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment