Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples Net DeepSpeed

From Leeroopedia
Revision as of 15:41, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_DeepSpeedExamples_Net_DeepSpeed.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Metadata

Field Value
Page Type Implementation
Repository Microsoft/DeepSpeedExamples
Title Net_DeepSpeed
Type Class Doc
Source File training/cifar/cifar10_deepspeed.py
Lines 166-209
Implements Principle:Microsoft_DeepSpeedExamples_DeepSpeed_MoE_Training

Overview

Concrete tool for defining a CNN with optional DeepSpeed MoE layers for CIFAR-10 training.

Description

The Net class in cifar10_deepspeed.py extends the baseline CIFAR-10 CNN with optional Mixture of Experts (MoE) support. It shares the same convolutional backbone as the tutorial version (two conv layers + pooling + two FC layers) but conditionally replaces the final classification layer with a DeepSpeed MoE layer followed by a separate output projection.

The key architectural decision is controlled by args.moe:

  • When args.moe=False (default): The model is identical to the baseline -- fc3 = nn.Linear(84, 10) maps directly to class logits. This is the standard dense path.
  • When args.moe=True: The final layer is replaced with:
    • One or more deepspeed.moe.layer.MoE layers, each wrapping a nn.Linear(84, 84) expert network
    • A final fc4 = nn.Linear(84, 10) projection to class logits

The MoE layers are stored in an nn.ModuleList to support multiple MoE layers in sequence (controlled by the --num-experts list argument). Each entry in args.num_experts creates one MoE layer with that many experts.

Code Reference

File: training/cifar/cifar10_deepspeed.py, Lines 166-209

class Net(nn.Module):
    def __init__(self, args):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.moe = args.moe
        if self.moe:
            fc3 = nn.Linear(84, 84)
            self.moe_layer_list = []
            for n_e in args.num_experts:
                # Create moe layers based on the number of experts.
                self.moe_layer_list.append(
                    deepspeed.moe.layer.MoE(
                        hidden_size=84,
                        expert=fc3,
                        num_experts=n_e,
                        ep_size=args.ep_world_size,
                        use_residual=args.mlp_type == "residual",
                        k=args.top_k,
                        min_capacity=args.min_capacity,
                        noisy_gate_policy=args.noisy_gate_policy,
                    )
                )
            self.moe_layer_list = nn.ModuleList(self.moe_layer_list)
            self.fc4 = nn.Linear(84, 10)
        else:
            self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        if self.moe:
            for layer in self.moe_layer_list:
                x, _, _ = layer(x)
            x = self.fc4(x)
        else:
            x = self.fc3(x)
        return x

Signature

class Net(nn.Module):
    def __init__(self, args: argparse.Namespace):
        """CNN for CIFAR-10 with optional MoE layers.

        Args:
            args: Parsed arguments. Relevant fields:
                - args.moe (bool): Enable MoE layers
                - args.num_experts (list[int]): Number of experts per MoE layer
                - args.ep_world_size (int): Expert parallel world size
                - args.mlp_type (str): "standard" or "residual"
                - args.top_k (int): Top-k gating (1 or 2)
                - args.min_capacity (int): Minimum expert capacity
                - args.noisy_gate_policy (str or None): Noise policy for gating
        """

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        Args:
            x: Input tensor of shape (B, 3, 32, 32).

        Returns:
            Logits tensor of shape (B, 10).
        """

I/O Contract

Direction Name Type Shape Description
Input x torch.Tensor (B, 3, 32, 32) Batch of normalized CIFAR-10 images
Output logits torch.Tensor (B, 10) Raw class scores for 10 classes

Constructor Input:

Parameter Type Description
args.moe bool Whether to enable MoE layers
args.num_experts list[int] Number of experts for each MoE layer (e.g., [2] or [2, 4])
args.ep_world_size int Expert parallel group size
args.mlp_type str "standard" or "residual" MoE mode
args.top_k int Number of experts activated per input (1 or 2)
args.min_capacity int Minimum expert capacity
args.noisy_gate_policy str or None Noise policy: None, "RSample", or "Jitter"

Architecture Diagrams

Dense Path (args.moe=False)

Input: (B, 3, 32, 32)
    |
[Conv2d(3,6,5)] --> ReLU --> [MaxPool2d(2,2)]     (B, 6, 14, 14)
    |
[Conv2d(6,16,5)] --> ReLU --> [MaxPool2d(2,2)]    (B, 16, 5, 5)
    |
[Flatten]                                          (B, 400)
    |
[Linear(400,120)] --> ReLU                         (B, 120)
    |
[Linear(120,84)] --> ReLU                          (B, 84)
    |
[Linear(84,10)]  -- fc3                            (B, 10) logits

MoE Path (args.moe=True, num_experts=[2])

Input: (B, 3, 32, 32)
    |
[Conv2d(3,6,5)] --> ReLU --> [MaxPool2d(2,2)]     (B, 6, 14, 14)
    |
[Conv2d(6,16,5)] --> ReLU --> [MaxPool2d(2,2)]    (B, 16, 5, 5)
    |
[Flatten]                                          (B, 400)
    |
[Linear(400,120)] --> ReLU                         (B, 120)
    |
[Linear(120,84)] --> ReLU                          (B, 84)
    |
[MoE Layer]                                        (B, 84)
    |--- Gate: softmax(W_g * x) --> top-k routing
    |--- Expert 0: Linear(84, 84)
    |--- Expert 1: Linear(84, 84)
    |--- Output: weighted sum of top-k expert outputs
    |
[Linear(84,10)]  -- fc4                            (B, 10) logits

Pyramid Residual MoE (num_experts=[2, 4])

...
[Linear(120,84)] --> ReLU                          (B, 84)
    |
[MoE Layer 0] -- 2 experts, residual               (B, 84)
    |--- MoE output + coefficient * Dense(x)
    |
[MoE Layer 1] -- 4 experts, residual               (B, 84)
    |--- MoE output + coefficient * Dense(x)
    |
[Linear(84,10)]  -- fc4                            (B, 10) logits

MoE Layer Configuration

Parameter Value Source
hidden_size 84 Matches fc2 output dimension
expert nn.Linear(84, 84) Identity-dimension expert
num_experts args.num_experts[i] Per-layer expert count from CLI
ep_size args.ep_world_size Expert parallel group size
use_residual args.mlp_type == "residual" Residual MoE toggle
k args.top_k Top-k gating (1 or 2)
min_capacity args.min_capacity Minimum tokens per expert
noisy_gate_policy args.noisy_gate_policy None, RSample, or Jitter

MoE Forward Pass Details

The MoE layer forward call returns a 3-tuple:

# In forward():
if self.moe:
    for layer in self.moe_layer_list:
        x, _, _ = layer(x)  # Returns (output, gate_loss, expert_count)
    x = self.fc4(x)

The gate_loss (auxiliary load balancing loss) and expert_count (expert utilization stats) are discarded in this example. In production MoE training, the gate loss is typically added to the main training loss to encourage balanced expert utilization.

Usage Examples

# Standard dense mode (no MoE)
deepspeed cifar10_deepspeed.py --deepspeed

# MoE with 2 experts, top-1 gating
deepspeed --num_gpus=2 cifar10_deepspeed.py --deepspeed \
    --moe --num-experts 2 --top-k 1 --ep-world-size 2 --moe-param-group

# Pyramid Residual MoE with 2 and 4 experts
deepspeed --num_gpus=2 cifar10_deepspeed.py --deepspeed \
    --moe --num-experts 2 4 --top-k 1 --mlp-type residual \
    --ep-world-size 2 --noisy-gate-policy RSample --moe-param-group

Training Loop Integration

After initialization, the training loop using the MoE model is identical to the dense case:

for epoch in range(args.epochs):
    running_loss = 0.0
    for i, data in enumerate(trainloader):
        inputs, labels = data[0].to(local_device), data[1].to(local_device)
        if target_dtype != None:
            inputs = inputs.to(target_dtype)

        outputs = model_engine(inputs)       # MoE routing happens inside forward()
        loss = criterion(outputs, labels)

        model_engine.backward(loss)          # Gradients flow through MoE + experts
        model_engine.step()                  # Updates expert + non-expert params

The DeepSpeed engine handles the AllToAll communication for expert dispatch and the separate gradient reduction for expert vs non-expert parameters.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment