Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Tensorflow Tfjs Training Data Preparation

From Leeroopedia


Knowledge Sources
Domains Data_Preprocessing, Deep_Learning
Last Updated 2026-02-10 00:00 GMT

Overview

Training data preparation is the process of converting raw data into properly shaped, typed, and normalized tensor representations that a neural network can consume during training.

Description

Neural networks operate exclusively on multi-dimensional numerical arrays (tensors). Raw data — whether it originates from CSV files, databases, user input, sensor readings, or images — must undergo a series of transformations before it can be fed into a model. Data preparation is often the most time-consuming and error-prone step in the machine learning pipeline, and mistakes here propagate silently into poor model performance.

The data preparation pipeline involves four core stages:

  1. Numeric conversion — All data must be represented as numbers. Categorical variables are one-hot encoded or mapped to integer indices. Text is tokenized and converted to numerical sequences. Images are decomposed into pixel values.
  2. Tensor shaping — Data must be organized into tensors with the correct dimensionality. For tabular data, this is typically a 2D tensor with shape [numSamples, numFeatures]. For images, it is [numSamples, height, width, channels]. For sequences, it is [numSamples, timesteps, features].
  3. Normalization/scaling — Raw feature values often span vastly different ranges (e.g., age 0-100 vs. salary 0-1000000). Normalizing all features to a common scale (typically [0, 1] or mean=0, std=1) prevents features with larger magnitudes from dominating the gradient updates.
  4. Batching and streaming — For datasets that exceed available memory, data must be streamed in batches via a dataset pipeline rather than loaded all at once.

Usage

Data preparation is required before every training run. The specific transformations depend on:

  • The data modality (tabular, image, text, time series).
  • The model architecture (input shape requirements).
  • The data volume (in-memory tensors vs. streaming datasets).

In-Memory vs. Streaming

Approach When to Use Advantages Disadvantages
In-memory tensors Dataset fits comfortably in RAM/GPU memory Simple API; fast access; no I/O overhead Memory limited; not suitable for large datasets
Streaming dataset Dataset is too large for memory, or data is generated dynamically Constant memory usage; supports infinite streams More complex API; potential I/O bottlenecks

Theoretical Basis

Tensor Dimensions and Shapes

The shape of a tensor defines its dimensionality and the size along each axis. Understanding shapes is critical because shape mismatches are the most common source of errors in neural network programming.

Data Type Typical Shape Example
Scalar [] A single loss value
Vector [n] One feature vector with n features
Matrix (2D) [rows, cols] Batch of feature vectors [numSamples, numFeatures]
3D tensor [d1, d2, d3] Batch of sequences [numSamples, timesteps, features]
4D tensor [d1, d2, d3, d4] Batch of images [numSamples, height, width, channels]

The batch dimension is always the first axis. When defining layer input shapes, the batch dimension is omitted (the framework adds it implicitly). So an inputShape of [784] means each individual sample is a 784-element vector, and the actual tensor fed to the model has shape [batchSize, 784].

Data Types

Tensors have a fixed data type (dtype). Common types:

dtype Description Use Case
float32 32-bit floating point Default for features, weights, and most computations
int32 32-bit integer Integer labels, indices
bool Boolean Masks, conditions

Neural network computations are almost always performed in float32. Input data should be converted to float32 unless there is a specific reason to use another type.

Normalization Strategies

Strategy Formula Output Range When to Use
Min-max scaling (x - min) / (max - min) [0, 1] Features with known bounds
Z-score (standard) (x - mean) / std ~ [-3, 3] Features with Gaussian distribution
Division by max x / max_value [0, 1] or [-1, 1] Pixel values (divide by 255)

Normalization should be computed on the training set only. The same statistics (min, max, mean, std) must then be applied to validation and test data to prevent data leakage.

Labels and One-Hot Encoding

For classification tasks, labels must match the model's output format:

  • Binary classification: Labels are scalars (0 or 1), output activation is sigmoid.
  • Multi-class classification with sparse labels: Labels are integers (0, 1, ..., C-1), loss is sparse categorical crossentropy.
  • Multi-class classification with one-hot labels: Labels are one-hot vectors (e.g., [0, 0, 1, 0] for class 2 out of 4), loss is categorical crossentropy.

Generator-Based Streaming

When data is too large for memory or is generated dynamically (e.g., data augmentation), a generator function yields individual samples on demand. The framework wraps this generator into a Dataset object that supports batching, shuffling, prefetching, and other pipeline operations. The generator pattern provides:

  • Constant memory usage — Only one batch is in memory at a time.
  • Lazy evaluation — Data is produced only when consumed.
  • Composability — Multiple transformations (map, filter, batch, shuffle) can be chained.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment