Principle:Microsoft DeepSpeedExamples Baseline PyTorch Training
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Repository | Microsoft/DeepSpeedExamples |
| Title | Baseline_PyTorch_Training |
| Sources | Doc: PyTorch Tutorial: Training a Classifier |
| Domains | Deep_Learning, Computer_Vision, Training |
| Related Implementation | Implementation:Microsoft_DeepSpeedExamples_Net_Tutorial |
Overview
A standard PyTorch training pattern using manual optimizer management, data loading, and evaluation for image classification.
Description
The baseline PyTorch pattern for training a CIFAR-10 image classifier involves a five-step workflow that serves as the canonical reference implementation before any DeepSpeed migration. This pattern establishes the fundamental building blocks:
- Data Preparation -- Load the CIFAR-10 dataset using
torchvision.datasets.CIFAR10and apply normalization transforms to convert PIL images (range [0, 1]) into tensors (range [-1, 1]). Atorch.utils.data.DataLoaderhandles batching, shuffling, and multi-process data loading. - Model Definition -- Define a simple Convolutional Neural Network (
Net) with two convolutional layers followed by three fully connected layers. The architecture maps 3x32x32 CIFAR-10 input images through successive feature extraction and classification stages. - Loss and Optimizer Setup -- Use
nn.CrossEntropyLossfor multi-class classification andoptim.SGDwith learning rate 0.001 and momentum 0.9. These are created as standalone objects and managed manually. - Training Loop -- Iterate over epochs and mini-batches with explicit calls to
optimizer.zero_grad(),loss.backward(), andoptimizer.step(). This three-call pattern (zero gradients, backward pass, optimizer step) is the fundamental PyTorch training contract. - Evaluation -- Run inference on the test set under
torch.no_grad()to compute overall and per-class accuracy.
This explicit control over every training component is the hallmark of standard PyTorch training. Each component (model, optimizer, scheduler, data loader) is a separate object that the developer must coordinate manually.
Theoretical Basis
Cross-Entropy Loss for Multi-Class Classification
The standard loss function for multi-class classification is the cross-entropy loss:
L = -sum_{i=1}^{C} y_i * log(p_i)
where:
- C is the number of classes (10 for CIFAR-10)
- y_i is the ground truth (one-hot encoded)
- p_i is the predicted probability for class i (output of softmax)
In PyTorch, nn.CrossEntropyLoss combines nn.LogSoftmax and nn.NLLLoss into a single operation, accepting raw logits directly.
SGD with Momentum
The optimizer uses Stochastic Gradient Descent with momentum:
v_t = momentum * v_{t-1} + gradient
parameters = parameters - lr * v_t
where:
- lr = 0.001 (learning rate)
- momentum = 0.9
Momentum accelerates convergence by accumulating a velocity vector in directions of persistent gradient descent, dampening oscillations.
Training Loop Contract
The standard PyTorch training loop follows a strict three-step contract per mini-batch:
optimizer.zero_grad() # Clear accumulated gradients from previous step loss.backward() # Compute gradients via backpropagation optimizer.step() # Update model parameters using computed gradients
This explicit pattern gives the developer full control but requires manual coordination. DeepSpeed migration replaces this with model_engine.backward(loss) and model_engine.step(), absorbing zero_grad() internally.
Architecture
The baseline CNN architecture for CIFAR-10 is structured as follows:
| Layer | Type | Input Shape | Output Shape | Parameters |
|---|---|---|---|---|
| conv1 | Conv2d(3, 6, 5) | (B, 3, 32, 32) | (B, 6, 28, 28) | 456 |
| pool | MaxPool2d(2, 2) | (B, 6, 28, 28) | (B, 6, 14, 14) | 0 |
| conv2 | Conv2d(6, 16, 5) | (B, 6, 14, 14) | (B, 16, 10, 10) | 2,416 |
| pool | MaxPool2d(2, 2) | (B, 16, 10, 10) | (B, 16, 5, 5) | 0 |
| fc1 | Linear(400, 120) | (B, 400) | (B, 120) | 48,120 |
| fc2 | Linear(120, 84) | (B, 120) | (B, 84) | 10,164 |
| fc3 | Linear(84, 10) | (B, 84) | (B, 10) | 850 |
| Total | 62,006 |
Key Hyperparameters
| Parameter | Value | Purpose |
|---|---|---|
| batch_size | 4 | Mini-batch size for DataLoader |
| epochs | 2 | Number of training passes over the dataset |
| learning_rate | 0.001 | SGD learning rate |
| momentum | 0.9 | SGD momentum coefficient |
| num_workers | 2 | Parallel data loading processes |
Code Pattern
The complete baseline training pattern:
# Step 1: Data preparation
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)
# Step 2: Model definition
net = Net()
net.to(device)
# Step 3: Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
# Step 4: Training loop
for epoch in range(2):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# Step 5: Evaluation
correct, total = 0, 0
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
Migration Points
When migrating from this baseline to DeepSpeed, the following changes are required:
- Optimizer creation is replaced by
deepspeed.initialize()which creates the optimizer internally - DataLoader creation is replaced by DeepSpeed's distributed data loader returned from
deepspeed.initialize() optimizer.zero_grad()is removed (handled internally by the engine)loss.backward()becomesmodel_engine.backward(loss)optimizer.step()becomesmodel_engine.step()- Device placement is managed by the DeepSpeed engine via
local_rank
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Net_Tutorial -- Reference PyTorch CNN implementation for CIFAR-10
- Principle:Microsoft_DeepSpeedExamples_DeepSpeed_Engine_Init -- DeepSpeed engine initialization that replaces manual setup
- Principle:Microsoft_DeepSpeedExamples_DeepSpeed_CLI_Integration -- CLI argument integration for DeepSpeed
- Principle:Microsoft_DeepSpeedExamples_Classification_Evaluation -- Evaluation methodology used in both baseline and DeepSpeed versions