Implementation:Microsoft DeepSpeedExamples Net DeepSpeed

Metadata

Field	Value
Page Type	Implementation
Repository	Microsoft/DeepSpeedExamples
Title	Net_DeepSpeed
Type	Class Doc
Source File	`training/cifar/cifar10_deepspeed.py`
Lines	166-209
Implements	Principle:Microsoft_DeepSpeedExamples_DeepSpeed_MoE_Training

Overview

Concrete tool for defining a CNN with optional DeepSpeed MoE layers for CIFAR-10 training.

Description

The Net class in cifar10_deepspeed.py extends the baseline CIFAR-10 CNN with optional Mixture of Experts (MoE) support. It shares the same convolutional backbone as the tutorial version (two conv layers + pooling + two FC layers) but conditionally replaces the final classification layer with a DeepSpeed MoE layer followed by a separate output projection.

The key architectural decision is controlled by args.moe:

When args.moe=False (default): The model is identical to the baseline -- fc3 = nn.Linear(84, 10) maps directly to class logits. This is the standard dense path.
When args.moe=True: The final layer is replaced with:
- One or more deepspeed.moe.layer.MoE layers, each wrapping a nn.Linear(84, 84) expert network
- A final fc4 = nn.Linear(84, 10) projection to class logits

The MoE layers are stored in an nn.ModuleList to support multiple MoE layers in sequence (controlled by the --num-experts list argument). Each entry in args.num_experts creates one MoE layer with that many experts.

Code Reference

File: training/cifar/cifar10_deepspeed.py, Lines 166-209

class Net(nn.Module):
    def __init__(self, args):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.moe = args.moe
        if self.moe:
            fc3 = nn.Linear(84, 84)
            self.moe_layer_list = []
            for n_e in args.num_experts:
                # Create moe layers based on the number of experts.
                self.moe_layer_list.append(
                    deepspeed.moe.layer.MoE(
                        hidden_size=84,
                        expert=fc3,
                        num_experts=n_e,
                        ep_size=args.ep_world_size,
                        use_residual=args.mlp_type == "residual",
                        k=args.top_k,
                        min_capacity=args.min_capacity,
                        noisy_gate_policy=args.noisy_gate_policy,
                    )
                )
            self.moe_layer_list = nn.ModuleList(self.moe_layer_list)
            self.fc4 = nn.Linear(84, 10)
        else:
            self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        if self.moe:
            for layer in self.moe_layer_list:
                x, _, _ = layer(x)
            x = self.fc4(x)
        else:
            x = self.fc3(x)
        return x

Signature

class Net(nn.Module):
    def __init__(self, args: argparse.Namespace):
        """CNN for CIFAR-10 with optional MoE layers.

        Args:
            args: Parsed arguments. Relevant fields:
                - args.moe (bool): Enable MoE layers
                - args.num_experts (list[int]): Number of experts per MoE layer
                - args.ep_world_size (int): Expert parallel world size
                - args.mlp_type (str): "standard" or "residual"
                - args.top_k (int): Top-k gating (1 or 2)
                - args.min_capacity (int): Minimum expert capacity
                - args.noisy_gate_policy (str or None): Noise policy for gating
        """

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        Args:
            x: Input tensor of shape (B, 3, 32, 32).

        Returns:
            Logits tensor of shape (B, 10).
        """

I/O Contract

Direction	Name	Type	Shape	Description
Input	x	`torch.Tensor`	(B, 3, 32, 32)	Batch of normalized CIFAR-10 images
Output	logits	`torch.Tensor`	(B, 10)	Raw class scores for 10 classes

Constructor Input:

Parameter	Type	Description
args.moe	`bool`	Whether to enable MoE layers
args.num_experts	`list[int]`	Number of experts for each MoE layer (e.g., [2] or [2, 4])
args.ep_world_size	`int`	Expert parallel group size
args.mlp_type	`str`	"standard" or "residual" MoE mode
args.top_k	`int`	Number of experts activated per input (1 or 2)
args.min_capacity	`int`	Minimum expert capacity
args.noisy_gate_policy	`str` or `None`	Noise policy: None, "RSample", or "Jitter"

Architecture Diagrams

Dense Path (args.moe=False)

Input: (B, 3, 32, 32)
    |
[Conv2d(3,6,5)] --> ReLU --> [MaxPool2d(2,2)]     (B, 6, 14, 14)
    |
[Conv2d(6,16,5)] --> ReLU --> [MaxPool2d(2,2)]    (B, 16, 5, 5)
    |
[Flatten]                                          (B, 400)
    |
[Linear(400,120)] --> ReLU                         (B, 120)
    |
[Linear(120,84)] --> ReLU                          (B, 84)
    |
[Linear(84,10)]  -- fc3                            (B, 10) logits

MoE Path (args.moe=True, num_experts=[2])

Input: (B, 3, 32, 32)
    |
[Conv2d(3,6,5)] --> ReLU --> [MaxPool2d(2,2)]     (B, 6, 14, 14)
    |
[Conv2d(6,16,5)] --> ReLU --> [MaxPool2d(2,2)]    (B, 16, 5, 5)
    |
[Flatten]                                          (B, 400)
    |
[Linear(400,120)] --> ReLU                         (B, 120)
    |
[Linear(120,84)] --> ReLU                          (B, 84)
    |
[MoE Layer]                                        (B, 84)
    |--- Gate: softmax(W_g * x) --> top-k routing
    |--- Expert 0: Linear(84, 84)
    |--- Expert 1: Linear(84, 84)
    |--- Output: weighted sum of top-k expert outputs
    |
[Linear(84,10)]  -- fc4                            (B, 10) logits

Pyramid Residual MoE (num_experts=[2, 4])

...
[Linear(120,84)] --> ReLU                          (B, 84)
    |
[MoE Layer 0] -- 2 experts, residual               (B, 84)
    |--- MoE output + coefficient * Dense(x)
    |
[MoE Layer 1] -- 4 experts, residual               (B, 84)
    |--- MoE output + coefficient * Dense(x)
    |
[Linear(84,10)]  -- fc4                            (B, 10) logits

MoE Layer Configuration

Parameter	Value	Source
hidden_size	84	Matches fc2 output dimension
expert	nn.Linear(84, 84)	Identity-dimension expert
num_experts	args.num_experts[i]	Per-layer expert count from CLI
ep_size	args.ep_world_size	Expert parallel group size
use_residual	args.mlp_type == "residual"	Residual MoE toggle
k	args.top_k	Top-k gating (1 or 2)
min_capacity	args.min_capacity	Minimum tokens per expert
noisy_gate_policy	args.noisy_gate_policy	None, RSample, or Jitter

MoE Forward Pass Details

The MoE layer forward call returns a 3-tuple:

# In forward():
if self.moe:
    for layer in self.moe_layer_list:
        x, _, _ = layer(x)  # Returns (output, gate_loss, expert_count)
    x = self.fc4(x)

The gate_loss (auxiliary load balancing loss) and expert_count (expert utilization stats) are discarded in this example. In production MoE training, the gate loss is typically added to the main training loss to encourage balanced expert utilization.

Usage Examples

# Standard dense mode (no MoE)
deepspeed cifar10_deepspeed.py --deepspeed

# MoE with 2 experts, top-1 gating
deepspeed --num_gpus=2 cifar10_deepspeed.py --deepspeed \
    --moe --num-experts 2 --top-k 1 --ep-world-size 2 --moe-param-group

# Pyramid Residual MoE with 2 and 4 experts
deepspeed --num_gpus=2 cifar10_deepspeed.py --deepspeed \
    --moe --num-experts 2 4 --top-k 1 --mlp-type residual \
    --ep-world-size 2 --noisy-gate-policy RSample --moe-param-group

Training Loop Integration

After initialization, the training loop using the MoE model is identical to the dense case:

for epoch in range(args.epochs):
    running_loss = 0.0
    for i, data in enumerate(trainloader):
        inputs, labels = data[0].to(local_device), data[1].to(local_device)
        if target_dtype != None:
            inputs = inputs.to(target_dtype)

        outputs = model_engine(inputs)       # MoE routing happens inside forward()
        loss = criterion(outputs, labels)

        model_engine.backward(loss)          # Gradients flow through MoE + experts
        model_engine.step()                  # Updates expert + non-expert params

The DeepSpeed engine handles the AllToAll communication for expert dispatch and the separate gradient reduction for expert vs non-expert parameters.

Related Pages

Principle:Microsoft_DeepSpeedExamples_DeepSpeed_MoE_Training -- The principle this implementation realizes
Implementation:Microsoft_DeepSpeedExamples_Net_Tutorial -- Baseline version without MoE
Implementation:Microsoft_DeepSpeedExamples_Add_Argument_CIFAR -- CLI arguments that configure MoE
Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Initialize_CIFAR -- Engine initialization with MoE param groups
Implementation:Microsoft_DeepSpeedExamples_Test_Function_CIFAR -- Evaluating the MoE model
Environment:Microsoft_DeepSpeedExamples_CIFAR10_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment