Principle:Microsoft DeepSpeedExamples Baseline PyTorch Training

Metadata

Field	Value
Page Type	Principle
Repository	Microsoft/DeepSpeedExamples
Title	Baseline_PyTorch_Training
Sources	Doc: PyTorch Tutorial: Training a Classifier
Domains	Deep_Learning, Computer_Vision, Training
Related Implementation	Implementation:Microsoft_DeepSpeedExamples_Net_Tutorial

Overview

A standard PyTorch training pattern using manual optimizer management, data loading, and evaluation for image classification.

Description

The baseline PyTorch pattern for training a CIFAR-10 image classifier involves a five-step workflow that serves as the canonical reference implementation before any DeepSpeed migration. This pattern establishes the fundamental building blocks:

Data Preparation -- Load the CIFAR-10 dataset using torchvision.datasets.CIFAR10 and apply normalization transforms to convert PIL images (range [0, 1]) into tensors (range [-1, 1]). A torch.utils.data.DataLoader handles batching, shuffling, and multi-process data loading.
Model Definition -- Define a simple Convolutional Neural Network (Net) with two convolutional layers followed by three fully connected layers. The architecture maps 3x32x32 CIFAR-10 input images through successive feature extraction and classification stages.
Loss and Optimizer Setup -- Use nn.CrossEntropyLoss for multi-class classification and optim.SGD with learning rate 0.001 and momentum 0.9. These are created as standalone objects and managed manually.
Training Loop -- Iterate over epochs and mini-batches with explicit calls to optimizer.zero_grad(), loss.backward(), and optimizer.step(). This three-call pattern (zero gradients, backward pass, optimizer step) is the fundamental PyTorch training contract.
Evaluation -- Run inference on the test set under torch.no_grad() to compute overall and per-class accuracy.

This explicit control over every training component is the hallmark of standard PyTorch training. Each component (model, optimizer, scheduler, data loader) is a separate object that the developer must coordinate manually.

Theoretical Basis

Cross-Entropy Loss for Multi-Class Classification

The standard loss function for multi-class classification is the cross-entropy loss:

L = -sum_{i=1}^{C} y_i * log(p_i)

where:

C is the number of classes (10 for CIFAR-10)
y_i is the ground truth (one-hot encoded)
p_i is the predicted probability for class i (output of softmax)

In PyTorch, nn.CrossEntropyLoss combines nn.LogSoftmax and nn.NLLLoss into a single operation, accepting raw logits directly.

SGD with Momentum

The optimizer uses Stochastic Gradient Descent with momentum:

v_t = momentum * v_{t-1} + gradient
parameters = parameters - lr * v_t

where:

lr = 0.001 (learning rate)
momentum = 0.9

Momentum accelerates convergence by accumulating a velocity vector in directions of persistent gradient descent, dampening oscillations.

Training Loop Contract

The standard PyTorch training loop follows a strict three-step contract per mini-batch:

optimizer.zero_grad()   # Clear accumulated gradients from previous step
loss.backward()         # Compute gradients via backpropagation
optimizer.step()        # Update model parameters using computed gradients

This explicit pattern gives the developer full control but requires manual coordination. DeepSpeed migration replaces this with model_engine.backward(loss) and model_engine.step(), absorbing zero_grad() internally.

Architecture

The baseline CNN architecture for CIFAR-10 is structured as follows:

Layer	Type	Input Shape	Output Shape	Parameters
conv1	Conv2d(3, 6, 5)	(B, 3, 32, 32)	(B, 6, 28, 28)	456
pool	MaxPool2d(2, 2)	(B, 6, 28, 28)	(B, 6, 14, 14)	0
conv2	Conv2d(6, 16, 5)	(B, 6, 14, 14)	(B, 16, 10, 10)	2,416
pool	MaxPool2d(2, 2)	(B, 16, 10, 10)	(B, 16, 5, 5)	0
fc1	Linear(400, 120)	(B, 400)	(B, 120)	48,120
fc2	Linear(120, 84)	(B, 120)	(B, 84)	10,164
fc3	Linear(84, 10)	(B, 84)	(B, 10)	850
Total				62,006

Key Hyperparameters

Parameter	Value	Purpose
batch_size	4	Mini-batch size for DataLoader
epochs	2	Number of training passes over the dataset
learning_rate	0.001	SGD learning rate
momentum	0.9	SGD momentum coefficient
num_workers	2	Parallel data loading processes

Code Pattern

The complete baseline training pattern:

# Step 1: Data preparation
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)

# Step 2: Model definition
net = Net()
net.to(device)

# Step 3: Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# Step 4: Training loop
for epoch in range(2):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data[0].to(device), data[1].to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

# Step 5: Evaluation
correct, total = 0, 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

Migration Points

When migrating from this baseline to DeepSpeed, the following changes are required:

Optimizer creation is replaced by deepspeed.initialize() which creates the optimizer internally
DataLoader creation is replaced by DeepSpeed's distributed data loader returned from deepspeed.initialize()
optimizer.zero_grad() is removed (handled internally by the engine)
loss.backward() becomes model_engine.backward(loss)
optimizer.step() becomes model_engine.step()
Device placement is managed by the DeepSpeed engine via local_rank

Related Pages

Implementation:Microsoft_DeepSpeedExamples_Net_Tutorial -- Reference PyTorch CNN implementation for CIFAR-10
Principle:Microsoft_DeepSpeedExamples_DeepSpeed_Engine_Init -- DeepSpeed engine initialization that replaces manual setup
Principle:Microsoft_DeepSpeedExamples_DeepSpeed_CLI_Integration -- CLI argument integration for DeepSpeed
Principle:Microsoft_DeepSpeedExamples_Classification_Evaluation -- Evaluation methodology used in both baseline and DeepSpeed versions

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment