Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft DeepSpeedExamples DeepSpeed MoE Training

From Leeroopedia


Metadata

Field Value
Page Type Principle
Repository Microsoft/DeepSpeedExamples
Title DeepSpeed_MoE_Training
Sources Paper: Switch Transformer: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, Paper: GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Doc: DeepSpeed MoE Tutorial
Domains Deep_Learning, Model_Architecture, Distributed_Training
Related Implementation Implementation:Microsoft_DeepSpeedExamples_Net_DeepSpeed

Overview

A model architecture technique that replaces dense feed-forward layers with Mixture of Experts (MoE) layers to increase model capacity without proportional compute increase.

Description

Mixture of Experts (MoE) is a conditional computation technique that scales model capacity (parameter count) while keeping the per-sample computational cost approximately constant. Instead of routing every input through every parameter, MoE uses a learned gating mechanism to route each input to a subset of specialized "expert" sub-networks.

In the context of the CIFAR-10 DeepSpeed example, MoE replaces the final fully-connected classification layer with multiple expert networks plus a gating function. This demonstrates the core MoE integration pattern:

  1. Expert Networks -- Multiple copies of a sub-network (e.g., nn.Linear(84, 84)) serve as experts. Each expert learns to specialize on different parts of the input distribution.
  2. Gating Network -- A learned router (typically a linear layer + softmax) assigns input tokens to experts based on a routing score.
  3. Top-K Selection -- Only the top-k scoring experts process each input, keeping compute constant regardless of the total number of experts.
  4. Expert Parallelism -- Experts can be distributed across GPUs (expert parallel world), allowing the model to have more experts than fit on a single device.

DeepSpeed provides deepspeed.moe.layer.MoE as a drop-in replacement for dense layers, handling gating, expert dispatch, load balancing, and expert parallelism transparently.

Theoretical Basis

Gating Mechanism

The gating function computes routing scores for each expert:

G(x) = TopK(softmax(W_g * x + noise))

where:

  • x is the input tensor
  • W_g is the learnable gating weight matrix (shape: [hidden_size, num_experts])
  • noise is optional noise injected for load balancing (controlled by noisy_gate_policy)
  • TopK selects the k highest-scoring experts

Expert Output Computation

The final output is a weighted sum of the selected expert outputs:

y = sum_{i in TopK} G(x)_i * E_i(x)

where:

  • G(x)_i is the gating weight for expert i
  • E_i(x) is the output of expert i applied to input x
  • The sum is only over the top-k experts (typically k=1 or k=2)

Load Balancing

Without intervention, MoE models tend to collapse -- routing most inputs to a small subset of experts. Several strategies address this:

Strategy Description DeepSpeed Parameter
Noisy Gating (RSample) Add Gaussian noise to gating logits before softmax, encouraging exploration noisy_gate_policy='RSample'
Noisy Gating (Jitter) Multiply inputs by uniform noise before gating noisy_gate_policy='Jitter'
Capacity Factor Limit the maximum number of tokens each expert can process min_capacity
Auxiliary Loss Add a loss term that penalizes unbalanced expert utilization Internal to DeepSpeed MoE

Residual MoE

The residual MoE variant (mlp_type='residual') adds a residual connection around the MoE layer:

y = MoE(x) + coefficient * Dense(x)

This provides a fallback path through a dense layer, improving training stability when expert routing is suboptimal. The coefficient is a learned scalar.

Expert Parallelism

Expert parallelism distributes experts across GPUs within a group:

GPU 0: Expert 0         GPU 1: Expert 1
    \                      /
     +--- AllToAll comm ---+
    /                      \
GPU 0: Process tokens    GPU 1: Process tokens
     for Expert 0              for Expert 1

The ep_world_size parameter controls the size of the expert parallel group. With ep_world_size=2 and 2 experts, each GPU holds one expert. With ep_world_size=1, all experts are replicated on every GPU.

MoE vs Dense Comparison

Aspect Dense Network MoE Network
Parameters Fixed (all active) Scaled by num_experts (sparse activation)
FLOPs per input Proportional to total parameters Proportional to k experts only
Memory All parameters on each GPU Distributed via expert parallelism
Training complexity Standard backprop Requires load balancing + expert routing
Communication AllReduce gradients AllToAll for expert dispatch + AllReduce for non-expert params

DeepSpeed MoE Layer API

deepspeed.moe.layer.MoE(
    hidden_size=84,                          # Input/output dimension
    expert=nn.Linear(84, 84),               # Expert module (cloned for each expert)
    num_experts=4,                           # Total number of experts
    ep_size=2,                               # Expert parallel world size
    use_residual=False,                      # Enable residual MoE
    k=1,                                     # Top-k routing
    min_capacity=0,                          # Minimum expert capacity
    noisy_gate_policy='RSample',             # Gating noise policy
)

The MoE layer returns a tuple (output, gate_loss, expert_count) where:

  • output -- The routed expert output tensor
  • gate_loss -- Auxiliary load balancing loss (can be added to training loss)
  • expert_count -- Dictionary tracking per-expert utilization

MoE Parameter Groups

When using ZeRO optimization with MoE, expert parameters and non-expert parameters must be placed in separate optimizer parameter groups. This is because expert parameters have different communication patterns (AllToAll vs AllReduce):

def create_moe_param_groups(model):
    """Create separate parameter groups for each expert."""
    parameters = {"params": [p for p in model.parameters()], "name": "parameters"}
    return split_params_into_different_moe_groups_for_optimizer(parameters)

This is controlled by the --moe-param-group CLI flag.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment