Principle:Tensorflow Tfjs Training Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Preprocessing, Deep_Learning |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Training data preparation is the process of converting raw data into properly shaped, typed, and normalized tensor representations that a neural network can consume during training.
Description
Neural networks operate exclusively on multi-dimensional numerical arrays (tensors). Raw data — whether it originates from CSV files, databases, user input, sensor readings, or images — must undergo a series of transformations before it can be fed into a model. Data preparation is often the most time-consuming and error-prone step in the machine learning pipeline, and mistakes here propagate silently into poor model performance.
The data preparation pipeline involves four core stages:
- Numeric conversion — All data must be represented as numbers. Categorical variables are one-hot encoded or mapped to integer indices. Text is tokenized and converted to numerical sequences. Images are decomposed into pixel values.
- Tensor shaping — Data must be organized into tensors with the correct dimensionality. For tabular data, this is typically a 2D tensor with shape [numSamples, numFeatures]. For images, it is [numSamples, height, width, channels]. For sequences, it is [numSamples, timesteps, features].
- Normalization/scaling — Raw feature values often span vastly different ranges (e.g., age 0-100 vs. salary 0-1000000). Normalizing all features to a common scale (typically [0, 1] or mean=0, std=1) prevents features with larger magnitudes from dominating the gradient updates.
- Batching and streaming — For datasets that exceed available memory, data must be streamed in batches via a dataset pipeline rather than loaded all at once.
Usage
Data preparation is required before every training run. The specific transformations depend on:
- The data modality (tabular, image, text, time series).
- The model architecture (input shape requirements).
- The data volume (in-memory tensors vs. streaming datasets).
In-Memory vs. Streaming
| Approach | When to Use | Advantages | Disadvantages |
|---|---|---|---|
| In-memory tensors | Dataset fits comfortably in RAM/GPU memory | Simple API; fast access; no I/O overhead | Memory limited; not suitable for large datasets |
| Streaming dataset | Dataset is too large for memory, or data is generated dynamically | Constant memory usage; supports infinite streams | More complex API; potential I/O bottlenecks |
Theoretical Basis
Tensor Dimensions and Shapes
The shape of a tensor defines its dimensionality and the size along each axis. Understanding shapes is critical because shape mismatches are the most common source of errors in neural network programming.
| Data Type | Typical Shape | Example |
|---|---|---|
| Scalar | [] | A single loss value |
| Vector | [n] | One feature vector with n features |
| Matrix (2D) | [rows, cols] | Batch of feature vectors [numSamples, numFeatures] |
| 3D tensor | [d1, d2, d3] | Batch of sequences [numSamples, timesteps, features] |
| 4D tensor | [d1, d2, d3, d4] | Batch of images [numSamples, height, width, channels] |
The batch dimension is always the first axis. When defining layer input shapes, the batch dimension is omitted (the framework adds it implicitly). So an inputShape of [784] means each individual sample is a 784-element vector, and the actual tensor fed to the model has shape [batchSize, 784].
Data Types
Tensors have a fixed data type (dtype). Common types:
| dtype | Description | Use Case |
|---|---|---|
| float32 | 32-bit floating point | Default for features, weights, and most computations |
| int32 | 32-bit integer | Integer labels, indices |
| bool | Boolean | Masks, conditions |
Neural network computations are almost always performed in float32. Input data should be converted to float32 unless there is a specific reason to use another type.
Normalization Strategies
| Strategy | Formula | Output Range | When to Use |
|---|---|---|---|
| Min-max scaling | (x - min) / (max - min) | [0, 1] | Features with known bounds |
| Z-score (standard) | (x - mean) / std | ~ [-3, 3] | Features with Gaussian distribution |
| Division by max | x / max_value | [0, 1] or [-1, 1] | Pixel values (divide by 255) |
Normalization should be computed on the training set only. The same statistics (min, max, mean, std) must then be applied to validation and test data to prevent data leakage.
Labels and One-Hot Encoding
For classification tasks, labels must match the model's output format:
- Binary classification: Labels are scalars (0 or 1), output activation is sigmoid.
- Multi-class classification with sparse labels: Labels are integers (0, 1, ..., C-1), loss is sparse categorical crossentropy.
- Multi-class classification with one-hot labels: Labels are one-hot vectors (e.g., [0, 0, 1, 0] for class 2 out of 4), loss is categorical crossentropy.
Generator-Based Streaming
When data is too large for memory or is generated dynamically (e.g., data augmentation), a generator function yields individual samples on demand. The framework wraps this generator into a Dataset object that supports batching, shuffling, prefetching, and other pipeline operations. The generator pattern provides:
- Constant memory usage — Only one batch is in memory at a time.
- Lazy evaluation — Data is produced only when consumed.
- Composability — Multiple transformations (map, filter, batch, shuffle) can be chained.