Implementation:Microsoft DeepSpeedExamples Net DeepSpeed
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation |
| Repository | Microsoft/DeepSpeedExamples |
| Title | Net_DeepSpeed |
| Type | Class Doc |
| Source File | training/cifar/cifar10_deepspeed.py
|
| Lines | 166-209 |
| Implements | Principle:Microsoft_DeepSpeedExamples_DeepSpeed_MoE_Training |
Overview
Concrete tool for defining a CNN with optional DeepSpeed MoE layers for CIFAR-10 training.
Description
The Net class in cifar10_deepspeed.py extends the baseline CIFAR-10 CNN with optional Mixture of Experts (MoE) support. It shares the same convolutional backbone as the tutorial version (two conv layers + pooling + two FC layers) but conditionally replaces the final classification layer with a DeepSpeed MoE layer followed by a separate output projection.
The key architectural decision is controlled by args.moe:
- When
args.moe=False(default): The model is identical to the baseline --fc3 = nn.Linear(84, 10)maps directly to class logits. This is the standard dense path. - When
args.moe=True: The final layer is replaced with:- One or more
deepspeed.moe.layer.MoElayers, each wrapping ann.Linear(84, 84)expert network - A final
fc4 = nn.Linear(84, 10)projection to class logits
- One or more
The MoE layers are stored in an nn.ModuleList to support multiple MoE layers in sequence (controlled by the --num-experts list argument). Each entry in args.num_experts creates one MoE layer with that many experts.
Code Reference
File: training/cifar/cifar10_deepspeed.py, Lines 166-209
class Net(nn.Module):
def __init__(self, args):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.moe = args.moe
if self.moe:
fc3 = nn.Linear(84, 84)
self.moe_layer_list = []
for n_e in args.num_experts:
# Create moe layers based on the number of experts.
self.moe_layer_list.append(
deepspeed.moe.layer.MoE(
hidden_size=84,
expert=fc3,
num_experts=n_e,
ep_size=args.ep_world_size,
use_residual=args.mlp_type == "residual",
k=args.top_k,
min_capacity=args.min_capacity,
noisy_gate_policy=args.noisy_gate_policy,
)
)
self.moe_layer_list = nn.ModuleList(self.moe_layer_list)
self.fc4 = nn.Linear(84, 10)
else:
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
if self.moe:
for layer in self.moe_layer_list:
x, _, _ = layer(x)
x = self.fc4(x)
else:
x = self.fc3(x)
return x
Signature
class Net(nn.Module):
def __init__(self, args: argparse.Namespace):
"""CNN for CIFAR-10 with optional MoE layers.
Args:
args: Parsed arguments. Relevant fields:
- args.moe (bool): Enable MoE layers
- args.num_experts (list[int]): Number of experts per MoE layer
- args.ep_world_size (int): Expert parallel world size
- args.mlp_type (str): "standard" or "residual"
- args.top_k (int): Top-k gating (1 or 2)
- args.min_capacity (int): Minimum expert capacity
- args.noisy_gate_policy (str or None): Noise policy for gating
"""
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass.
Args:
x: Input tensor of shape (B, 3, 32, 32).
Returns:
Logits tensor of shape (B, 10).
"""
I/O Contract
| Direction | Name | Type | Shape | Description |
|---|---|---|---|---|
| Input | x | torch.Tensor |
(B, 3, 32, 32) | Batch of normalized CIFAR-10 images |
| Output | logits | torch.Tensor |
(B, 10) | Raw class scores for 10 classes |
Constructor Input:
| Parameter | Type | Description |
|---|---|---|
| args.moe | bool |
Whether to enable MoE layers |
| args.num_experts | list[int] |
Number of experts for each MoE layer (e.g., [2] or [2, 4]) |
| args.ep_world_size | int |
Expert parallel group size |
| args.mlp_type | str |
"standard" or "residual" MoE mode |
| args.top_k | int |
Number of experts activated per input (1 or 2) |
| args.min_capacity | int |
Minimum expert capacity |
| args.noisy_gate_policy | str or None |
Noise policy: None, "RSample", or "Jitter" |
Architecture Diagrams
Dense Path (args.moe=False)
Input: (B, 3, 32, 32)
|
[Conv2d(3,6,5)] --> ReLU --> [MaxPool2d(2,2)] (B, 6, 14, 14)
|
[Conv2d(6,16,5)] --> ReLU --> [MaxPool2d(2,2)] (B, 16, 5, 5)
|
[Flatten] (B, 400)
|
[Linear(400,120)] --> ReLU (B, 120)
|
[Linear(120,84)] --> ReLU (B, 84)
|
[Linear(84,10)] -- fc3 (B, 10) logits
MoE Path (args.moe=True, num_experts=[2])
Input: (B, 3, 32, 32)
|
[Conv2d(3,6,5)] --> ReLU --> [MaxPool2d(2,2)] (B, 6, 14, 14)
|
[Conv2d(6,16,5)] --> ReLU --> [MaxPool2d(2,2)] (B, 16, 5, 5)
|
[Flatten] (B, 400)
|
[Linear(400,120)] --> ReLU (B, 120)
|
[Linear(120,84)] --> ReLU (B, 84)
|
[MoE Layer] (B, 84)
|--- Gate: softmax(W_g * x) --> top-k routing
|--- Expert 0: Linear(84, 84)
|--- Expert 1: Linear(84, 84)
|--- Output: weighted sum of top-k expert outputs
|
[Linear(84,10)] -- fc4 (B, 10) logits
Pyramid Residual MoE (num_experts=[2, 4])
...
[Linear(120,84)] --> ReLU (B, 84)
|
[MoE Layer 0] -- 2 experts, residual (B, 84)
|--- MoE output + coefficient * Dense(x)
|
[MoE Layer 1] -- 4 experts, residual (B, 84)
|--- MoE output + coefficient * Dense(x)
|
[Linear(84,10)] -- fc4 (B, 10) logits
MoE Layer Configuration
| Parameter | Value | Source |
|---|---|---|
| hidden_size | 84 | Matches fc2 output dimension |
| expert | nn.Linear(84, 84) | Identity-dimension expert |
| num_experts | args.num_experts[i] | Per-layer expert count from CLI |
| ep_size | args.ep_world_size | Expert parallel group size |
| use_residual | args.mlp_type == "residual" | Residual MoE toggle |
| k | args.top_k | Top-k gating (1 or 2) |
| min_capacity | args.min_capacity | Minimum tokens per expert |
| noisy_gate_policy | args.noisy_gate_policy | None, RSample, or Jitter |
MoE Forward Pass Details
The MoE layer forward call returns a 3-tuple:
# In forward():
if self.moe:
for layer in self.moe_layer_list:
x, _, _ = layer(x) # Returns (output, gate_loss, expert_count)
x = self.fc4(x)
The gate_loss (auxiliary load balancing loss) and expert_count (expert utilization stats) are discarded in this example. In production MoE training, the gate loss is typically added to the main training loss to encourage balanced expert utilization.
Usage Examples
# Standard dense mode (no MoE)
deepspeed cifar10_deepspeed.py --deepspeed
# MoE with 2 experts, top-1 gating
deepspeed --num_gpus=2 cifar10_deepspeed.py --deepspeed \
--moe --num-experts 2 --top-k 1 --ep-world-size 2 --moe-param-group
# Pyramid Residual MoE with 2 and 4 experts
deepspeed --num_gpus=2 cifar10_deepspeed.py --deepspeed \
--moe --num-experts 2 4 --top-k 1 --mlp-type residual \
--ep-world-size 2 --noisy-gate-policy RSample --moe-param-group
Training Loop Integration
After initialization, the training loop using the MoE model is identical to the dense case:
for epoch in range(args.epochs):
running_loss = 0.0
for i, data in enumerate(trainloader):
inputs, labels = data[0].to(local_device), data[1].to(local_device)
if target_dtype != None:
inputs = inputs.to(target_dtype)
outputs = model_engine(inputs) # MoE routing happens inside forward()
loss = criterion(outputs, labels)
model_engine.backward(loss) # Gradients flow through MoE + experts
model_engine.step() # Updates expert + non-expert params
The DeepSpeed engine handles the AllToAll communication for expert dispatch and the separate gradient reduction for expert vs non-expert parameters.
Related Pages
- Principle:Microsoft_DeepSpeedExamples_DeepSpeed_MoE_Training -- The principle this implementation realizes
- Implementation:Microsoft_DeepSpeedExamples_Net_Tutorial -- Baseline version without MoE
- Implementation:Microsoft_DeepSpeedExamples_Add_Argument_CIFAR -- CLI arguments that configure MoE
- Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Initialize_CIFAR -- Engine initialization with MoE param groups
- Implementation:Microsoft_DeepSpeedExamples_Test_Function_CIFAR -- Evaluating the MoE model
- Environment:Microsoft_DeepSpeedExamples_CIFAR10_Training_Environment