Principle:Ggml_org_Ggml_Training_Data_Loading
Summary
Loading and organizing training data into structured datasets for neural network training. The dataset abstraction provides a structured interface between raw data files and the training loop, handling normalization, label encoding, and efficient batching so that model code can consume data through a uniform API regardless of the underlying data format.
Theory
Training a neural network requires converting raw stored data into a form the model can consume. This process involves several well-established stages:
Dataset Abstraction
A dataset encapsulates two parallel collections: data points (input features) and labels (ground-truth targets). Each data point is a fixed-length vector of floating-point values, and each label is a vector encoding the correct output class or regression target. The abstraction decouples storage format from training logic -- the same training loop works whether the underlying source is MNIST, CIFAR, or any other dataset that conforms to the interface.
Data Normalization
Raw input values are typically stored as integers (e.g., pixel intensities in the range [0, 255]). Before feeding them to a network, values are normalized to the [0, 1] range by dividing by the maximum representable value:
normalized = raw_value / 255.0
Normalization ensures that gradient magnitudes remain well-behaved during back-propagation and prevents features with large numeric ranges from dominating the loss.
Label Encoding (One-Hot)
Classification labels are encoded as one-hot vectors -- a vector of length equal to the number of classes, with a 1.0 at the index corresponding to the true class and 0.0 elsewhere. For a 10-class problem such as MNIST digit recognition:
| Digit | One-Hot Vector |
|---|---|
| 0 | [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
|
| 3 | [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
|
| 9 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
|
One-hot encoding converts a discrete class index into a continuous target vector suitable for cross-entropy loss computation.
Batching and Shuffling
Datasets are divided into shards (batches) of a fixed size. During each training epoch, shards are shuffled so that the model does not see data in the same order every time. Shuffling reduces the risk of the optimizer converging to poor local minima caused by correlated consecutive samples. The shard size determines how many samples are drawn together for a single gradient update (mini-batch stochastic gradient descent).
Core Concepts
- Raw data ingestion -- read binary-formatted data files (e.g., IDX format for MNIST images and labels) and extract individual samples into contiguous memory.
- Normalization to [0, 1] -- rescale integer-valued features to floating-point values in the unit interval, ensuring stable gradient flow.
- One-hot label encoding -- convert scalar class indices to fixed-length binary vectors that serve as targets for a softmax output layer.
- Shard organization -- partition the full dataset into fixed-size batches (shards) that can be independently shuffled and fed to the optimizer.
- Tensor-backed storage -- store both data and labels as GGML tensors (
[ne_datapoint, ndata]and[ne_label, ndata]), enabling direct use with GGML computation graphs and backend memory management.
Key Operations
| Operation | Description |
|---|---|
| Read raw data | Parse binary file headers and bulk-read sample bytes into a flat buffer (e.g., 28x28 pixel images from IDX files). |
| Normalize | Divide each byte value by 255.0 to produce GGML_TYPE_F32 values in [0, 1].
|
| Encode labels | Read class indices and expand each into a one-hot vector of length ne_label.
|
| Build dataset | Allocate a ggml_opt_dataset_t holding a data tensor and a labels tensor, sized to the full training or test set.
|
| Shuffle shards | Randomly permute shard order before each epoch to improve generalization. |
Problem Solved
Without a dataset abstraction, every training example would require ad-hoc parsing, normalization, and batching logic scattered throughout the training loop. The Training Data Loading principle centralizes these responsibilities into a single, reusable layer that:
- Reads raw binary data once into efficiently laid-out tensors.
- Normalizes and encodes data at load time, avoiding per-iteration overhead.
- Provides a shard-based iterator that the optimizer consumes without knowledge of the underlying file format.