Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft DeepSpeedExamples Baseline PyTorch Training

From Leeroopedia


Metadata

Field Value
Page Type Principle
Repository Microsoft/DeepSpeedExamples
Title Baseline_PyTorch_Training
Sources Doc: PyTorch Tutorial: Training a Classifier
Domains Deep_Learning, Computer_Vision, Training
Related Implementation Implementation:Microsoft_DeepSpeedExamples_Net_Tutorial

Overview

A standard PyTorch training pattern using manual optimizer management, data loading, and evaluation for image classification.

Description

The baseline PyTorch pattern for training a CIFAR-10 image classifier involves a five-step workflow that serves as the canonical reference implementation before any DeepSpeed migration. This pattern establishes the fundamental building blocks:

  1. Data Preparation -- Load the CIFAR-10 dataset using torchvision.datasets.CIFAR10 and apply normalization transforms to convert PIL images (range [0, 1]) into tensors (range [-1, 1]). A torch.utils.data.DataLoader handles batching, shuffling, and multi-process data loading.
  2. Model Definition -- Define a simple Convolutional Neural Network (Net) with two convolutional layers followed by three fully connected layers. The architecture maps 3x32x32 CIFAR-10 input images through successive feature extraction and classification stages.
  3. Loss and Optimizer Setup -- Use nn.CrossEntropyLoss for multi-class classification and optim.SGD with learning rate 0.001 and momentum 0.9. These are created as standalone objects and managed manually.
  4. Training Loop -- Iterate over epochs and mini-batches with explicit calls to optimizer.zero_grad(), loss.backward(), and optimizer.step(). This three-call pattern (zero gradients, backward pass, optimizer step) is the fundamental PyTorch training contract.
  5. Evaluation -- Run inference on the test set under torch.no_grad() to compute overall and per-class accuracy.

This explicit control over every training component is the hallmark of standard PyTorch training. Each component (model, optimizer, scheduler, data loader) is a separate object that the developer must coordinate manually.

Theoretical Basis

Cross-Entropy Loss for Multi-Class Classification

The standard loss function for multi-class classification is the cross-entropy loss:

L = -sum_{i=1}^{C} y_i * log(p_i)

where:

  • C is the number of classes (10 for CIFAR-10)
  • y_i is the ground truth (one-hot encoded)
  • p_i is the predicted probability for class i (output of softmax)

In PyTorch, nn.CrossEntropyLoss combines nn.LogSoftmax and nn.NLLLoss into a single operation, accepting raw logits directly.

SGD with Momentum

The optimizer uses Stochastic Gradient Descent with momentum:

v_t = momentum * v_{t-1} + gradient
parameters = parameters - lr * v_t

where:

  • lr = 0.001 (learning rate)
  • momentum = 0.9

Momentum accelerates convergence by accumulating a velocity vector in directions of persistent gradient descent, dampening oscillations.

Training Loop Contract

The standard PyTorch training loop follows a strict three-step contract per mini-batch:

optimizer.zero_grad()   # Clear accumulated gradients from previous step
loss.backward()         # Compute gradients via backpropagation
optimizer.step()        # Update model parameters using computed gradients

This explicit pattern gives the developer full control but requires manual coordination. DeepSpeed migration replaces this with model_engine.backward(loss) and model_engine.step(), absorbing zero_grad() internally.

Architecture

The baseline CNN architecture for CIFAR-10 is structured as follows:

Layer Type Input Shape Output Shape Parameters
conv1 Conv2d(3, 6, 5) (B, 3, 32, 32) (B, 6, 28, 28) 456
pool MaxPool2d(2, 2) (B, 6, 28, 28) (B, 6, 14, 14) 0
conv2 Conv2d(6, 16, 5) (B, 6, 14, 14) (B, 16, 10, 10) 2,416
pool MaxPool2d(2, 2) (B, 16, 10, 10) (B, 16, 5, 5) 0
fc1 Linear(400, 120) (B, 400) (B, 120) 48,120
fc2 Linear(120, 84) (B, 120) (B, 84) 10,164
fc3 Linear(84, 10) (B, 84) (B, 10) 850
Total 62,006

Key Hyperparameters

Parameter Value Purpose
batch_size 4 Mini-batch size for DataLoader
epochs 2 Number of training passes over the dataset
learning_rate 0.001 SGD learning rate
momentum 0.9 SGD momentum coefficient
num_workers 2 Parallel data loading processes

Code Pattern

The complete baseline training pattern:

# Step 1: Data preparation
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)

# Step 2: Model definition
net = Net()
net.to(device)

# Step 3: Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# Step 4: Training loop
for epoch in range(2):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data[0].to(device), data[1].to(device)
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

# Step 5: Evaluation
correct, total = 0, 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

Migration Points

When migrating from this baseline to DeepSpeed, the following changes are required:

  • Optimizer creation is replaced by deepspeed.initialize() which creates the optimizer internally
  • DataLoader creation is replaced by DeepSpeed's distributed data loader returned from deepspeed.initialize()
  • optimizer.zero_grad() is removed (handled internally by the engine)
  • loss.backward() becomes model_engine.backward(loss)
  • optimizer.step() becomes model_engine.step()
  • Device placement is managed by the DeepSpeed engine via local_rank

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment