Principle:Ggml_org_Ggml_Training_Data_Loading

Summary

Loading and organizing training data into structured datasets for neural network training. The dataset abstraction provides a structured interface between raw data files and the training loop, handling normalization, label encoding, and efficient batching so that model code can consume data through a uniform API regardless of the underlying data format.

Theory

Training a neural network requires converting raw stored data into a form the model can consume. This process involves several well-established stages:

Dataset Abstraction

A dataset encapsulates two parallel collections: data points (input features) and labels (ground-truth targets). Each data point is a fixed-length vector of floating-point values, and each label is a vector encoding the correct output class or regression target. The abstraction decouples storage format from training logic -- the same training loop works whether the underlying source is MNIST, CIFAR, or any other dataset that conforms to the interface.

Data Normalization

Raw input values are typically stored as integers (e.g., pixel intensities in the range [0, 255]). Before feeding them to a network, values are normalized to the [0, 1] range by dividing by the maximum representable value:

normalized = raw_value / 255.0

Normalization ensures that gradient magnitudes remain well-behaved during back-propagation and prevents features with large numeric ranges from dominating the loss.

Label Encoding (One-Hot)

Classification labels are encoded as one-hot vectors -- a vector of length equal to the number of classes, with a 1.0 at the index corresponding to the true class and 0.0 elsewhere. For a 10-class problem such as MNIST digit recognition:

Digit	One-Hot Vector
0	`[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]`
3	`[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]`
9	`[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]`

One-hot encoding converts a discrete class index into a continuous target vector suitable for cross-entropy loss computation.

Batching and Shuffling

Datasets are divided into shards (batches) of a fixed size. During each training epoch, shards are shuffled so that the model does not see data in the same order every time. Shuffling reduces the risk of the optimizer converging to poor local minima caused by correlated consecutive samples. The shard size determines how many samples are drawn together for a single gradient update (mini-batch stochastic gradient descent).

Core Concepts

Raw data ingestion -- read binary-formatted data files (e.g., IDX format for MNIST images and labels) and extract individual samples into contiguous memory.
Normalization to [0, 1] -- rescale integer-valued features to floating-point values in the unit interval, ensuring stable gradient flow.
One-hot label encoding -- convert scalar class indices to fixed-length binary vectors that serve as targets for a softmax output layer.
Shard organization -- partition the full dataset into fixed-size batches (shards) that can be independently shuffled and fed to the optimizer.
Tensor-backed storage -- store both data and labels as GGML tensors ([ne_datapoint, ndata] and [ne_label, ndata]), enabling direct use with GGML computation graphs and backend memory management.

Key Operations

Operation	Description
Read raw data	Parse binary file headers and bulk-read sample bytes into a flat buffer (e.g., 28x28 pixel images from IDX files).
Normalize	Divide each byte value by 255.0 to produce `GGML_TYPE_F32` values in [0, 1].
Encode labels	Read class indices and expand each into a one-hot vector of length `ne_label`.
Build dataset	Allocate a `ggml_opt_dataset_t` holding a data tensor and a labels tensor, sized to the full training or test set.
Shuffle shards	Randomly permute shard order before each epoch to improve generalization.

Problem Solved

Without a dataset abstraction, every training example would require ad-hoc parsing, normalization, and batching logic scattered throughout the training loop. The Training Data Loading principle centralizes these responsibilities into a single, reusable layer that:

Reads raw binary data once into efficiently laid-out tensors.
Normalizes and encodes data at load time, avoiding per-iteration overhead.
Provides a shard-based iterator that the optimizer consumes without knowledge of the underlying file format.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment