Implementation:Microsoft DeepSpeedExamples Net Tutorial

Metadata

Field	Value
Page Type	Implementation
Repository	Microsoft/DeepSpeedExamples
Title	Net_Tutorial
Type	Pattern Doc (reference implementation)
Source File	`training/cifar/cifar10_tutorial.py`
Lines	130-197
Implements	Principle:Microsoft_DeepSpeedExamples_Baseline_PyTorch_Training

Overview

Reference PyTorch CNN implementation for CIFAR-10 that serves as the baseline before DeepSpeed migration.

Description

The Net class in cifar10_tutorial.py is the canonical baseline CNN for the CIFAR-10 Getting Started workflow. It implements a straightforward convolutional neural network with two convolutional layers followed by three fully connected layers. This model, combined with the surrounding training loop, optimizer setup, and evaluation code, constitutes the complete baseline pattern that DeepSpeed migration builds upon.

The implementation follows standard PyTorch conventions:

Layers are defined in __init__ as module attributes
The forward method chains these layers with activation functions and pooling
The model is instantiated and moved to a device explicitly
An external optimizer (optim.SGD) and loss function (nn.CrossEntropyLoss) are created separately

The training loop at lines 173-197 demonstrates the explicit three-call pattern (zero_grad / backward / step) that DeepSpeed will later absorb into its engine.

Code Reference

File: training/cifar/cifar10_tutorial.py, Lines 130-197

Model Definition (Lines 130-147)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Optimizer and Loss Setup (Lines 160-163)

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

Training Loop (Lines 173-197)

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].to(device), data[1].to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:  # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

Signature

class Net(nn.Module):
    def __init__(self):
        # Conv2d(3, 6, 5) -> MaxPool2d(2, 2) -> Conv2d(6, 16, 5) -> MaxPool2d(2, 2)
        # FC(16*5*5, 120) -> FC(120, 84) -> FC(84, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        ...

I/O Contract

Direction	Name	Type	Description
Input	x	`torch.Tensor` (B, 3, 32, 32)	Batch of normalized CIFAR-10 images, range [-1, 1]
Output	logits	`torch.Tensor` (B, 10)	Raw class scores (logits) for 10 CIFAR-10 classes

Training I/O:

Component	Type	Configuration
trainloader	`DataLoader`	batch_size=4, shuffle=True, num_workers=2
testloader	`DataLoader`	batch_size=4, shuffle=False, num_workers=2
criterion	`nn.CrossEntropyLoss`	Default (mean reduction)
optimizer	`optim.SGD`	lr=0.001, momentum=0.9

Architecture Diagram

Input: (B, 3, 32, 32)
        |
   [Conv2d(3, 6, 5)]  --> ReLU --> [MaxPool2d(2, 2)]
        |                              Output: (B, 6, 14, 14)
   [Conv2d(6, 16, 5)] --> ReLU --> [MaxPool2d(2, 2)]
        |                              Output: (B, 16, 5, 5)
   [Flatten]                           Output: (B, 400)
        |
   [Linear(400, 120)]  --> ReLU        Output: (B, 120)
        |
   [Linear(120, 84)]   --> ReLU        Output: (B, 84)
        |
   [Linear(84, 10)]                    Output: (B, 10) -- raw logits

Usage Example

# Run the baseline tutorial directly
python cifar10_tutorial.py

# Programmatic usage
import torch
import torch.nn as nn
import torch.nn.functional as F

net = Net()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net.to(device)

# Single forward pass
sample_input = torch.randn(4, 3, 32, 32).to(device)
logits = net(sample_input)  # Shape: (4, 10)
predicted_classes = torch.argmax(logits, dim=1)  # Shape: (4,)

Key Differences from DeepSpeed Version

Aspect	Baseline (Net_Tutorial)	DeepSpeed (Net_DeepSpeed)
Constructor	`__init__(self)` -- no arguments	`__init__(self, args)` -- accepts argument namespace
Final layer	`self.fc3 = nn.Linear(84, 10)`	Optionally replaced with MoE layer + `fc4`
Optimizer	Manual `optim.SGD`	Created internally by `deepspeed.initialize()`
DataLoader	Manual `DataLoader`	Created by DeepSpeed with distributed sampling
Training loop	`zero_grad() / backward() / step()`	`model_engine.backward() / model_engine.step()`
Device management	Manual `.to(device)`	Managed by DeepSpeed engine via `local_rank`

Related Pages

Principle:Microsoft_DeepSpeedExamples_Baseline_PyTorch_Training -- The principle this implementation realizes
Implementation:Microsoft_DeepSpeedExamples_Net_DeepSpeed -- DeepSpeed-enhanced version of the same model
Implementation:Microsoft_DeepSpeedExamples_Test_Function_CIFAR -- Evaluation function used after training
Environment:Microsoft_DeepSpeedExamples_CIFAR10_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment