Principle:Microsoft DeepSpeedExamples DeepSpeed MoE Training

Metadata

Field	Value
Page Type	Principle
Repository	Microsoft/DeepSpeedExamples
Title	DeepSpeed_MoE_Training
Sources	Paper: Switch Transformer: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, Paper: GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Doc: DeepSpeed MoE Tutorial
Domains	Deep_Learning, Model_Architecture, Distributed_Training
Related Implementation	Implementation:Microsoft_DeepSpeedExamples_Net_DeepSpeed

Overview

A model architecture technique that replaces dense feed-forward layers with Mixture of Experts (MoE) layers to increase model capacity without proportional compute increase.

Description

Mixture of Experts (MoE) is a conditional computation technique that scales model capacity (parameter count) while keeping the per-sample computational cost approximately constant. Instead of routing every input through every parameter, MoE uses a learned gating mechanism to route each input to a subset of specialized "expert" sub-networks.

In the context of the CIFAR-10 DeepSpeed example, MoE replaces the final fully-connected classification layer with multiple expert networks plus a gating function. This demonstrates the core MoE integration pattern:

Expert Networks -- Multiple copies of a sub-network (e.g., nn.Linear(84, 84)) serve as experts. Each expert learns to specialize on different parts of the input distribution.
Gating Network -- A learned router (typically a linear layer + softmax) assigns input tokens to experts based on a routing score.
Top-K Selection -- Only the top-k scoring experts process each input, keeping compute constant regardless of the total number of experts.
Expert Parallelism -- Experts can be distributed across GPUs (expert parallel world), allowing the model to have more experts than fit on a single device.

DeepSpeed provides deepspeed.moe.layer.MoE as a drop-in replacement for dense layers, handling gating, expert dispatch, load balancing, and expert parallelism transparently.

Theoretical Basis

Gating Mechanism

The gating function computes routing scores for each expert:

G(x) = TopK(softmax(W_g * x + noise))

where:

x is the input tensor
W_g is the learnable gating weight matrix (shape: [hidden_size, num_experts])
noise is optional noise injected for load balancing (controlled by noisy_gate_policy)
TopK selects the k highest-scoring experts

Expert Output Computation

The final output is a weighted sum of the selected expert outputs:

y = sum_{i in TopK} G(x)_i * E_i(x)

where:

G(x)_i is the gating weight for expert i
E_i(x) is the output of expert i applied to input x
The sum is only over the top-k experts (typically k=1 or k=2)

Load Balancing

Without intervention, MoE models tend to collapse -- routing most inputs to a small subset of experts. Several strategies address this:

Strategy	Description	DeepSpeed Parameter
Noisy Gating (RSample)	Add Gaussian noise to gating logits before softmax, encouraging exploration	`noisy_gate_policy='RSample'`
Noisy Gating (Jitter)	Multiply inputs by uniform noise before gating	`noisy_gate_policy='Jitter'`
Capacity Factor	Limit the maximum number of tokens each expert can process	`min_capacity`
Auxiliary Loss	Add a loss term that penalizes unbalanced expert utilization	Internal to DeepSpeed MoE

Residual MoE

The residual MoE variant (mlp_type='residual') adds a residual connection around the MoE layer:

y = MoE(x) + coefficient * Dense(x)

This provides a fallback path through a dense layer, improving training stability when expert routing is suboptimal. The coefficient is a learned scalar.

Expert Parallelism

Expert parallelism distributes experts across GPUs within a group:

GPU 0: Expert 0         GPU 1: Expert 1
    \                      /
     +--- AllToAll comm ---+
    /                      \
GPU 0: Process tokens    GPU 1: Process tokens
     for Expert 0              for Expert 1

The ep_world_size parameter controls the size of the expert parallel group. With ep_world_size=2 and 2 experts, each GPU holds one expert. With ep_world_size=1, all experts are replicated on every GPU.

MoE vs Dense Comparison

Aspect	Dense Network	MoE Network
Parameters	Fixed (all active)	Scaled by num_experts (sparse activation)
FLOPs per input	Proportional to total parameters	Proportional to k experts only
Memory	All parameters on each GPU	Distributed via expert parallelism
Training complexity	Standard backprop	Requires load balancing + expert routing
Communication	AllReduce gradients	AllToAll for expert dispatch + AllReduce for non-expert params

DeepSpeed MoE Layer API

deepspeed.moe.layer.MoE(
    hidden_size=84,                          # Input/output dimension
    expert=nn.Linear(84, 84),               # Expert module (cloned for each expert)
    num_experts=4,                           # Total number of experts
    ep_size=2,                               # Expert parallel world size
    use_residual=False,                      # Enable residual MoE
    k=1,                                     # Top-k routing
    min_capacity=0,                          # Minimum expert capacity
    noisy_gate_policy='RSample',             # Gating noise policy
)

The MoE layer returns a tuple (output, gate_loss, expert_count) where:

output -- The routed expert output tensor
gate_loss -- Auxiliary load balancing loss (can be added to training loss)
expert_count -- Dictionary tracking per-expert utilization

MoE Parameter Groups

When using ZeRO optimization with MoE, expert parameters and non-expert parameters must be placed in separate optimizer parameter groups. This is because expert parameters have different communication patterns (AllToAll vs AllReduce):

def create_moe_param_groups(model):
    """Create separate parameter groups for each expert."""
    parameters = {"params": [p for p in model.parameters()], "name": "parameters"}
    return split_params_into_different_moe_groups_for_optimizer(parameters)

This is controlled by the --moe-param-group CLI flag.

Related Pages

Implementation:Microsoft_DeepSpeedExamples_Net_DeepSpeed -- CNN with optional MoE layer for CIFAR-10
Principle:Microsoft_DeepSpeedExamples_DeepSpeed_Engine_Init -- Engine initialization that manages MoE training
Principle:Microsoft_DeepSpeedExamples_DeepSpeed_CLI_Integration -- MoE-related CLI arguments
Principle:Microsoft_DeepSpeedExamples_Classification_Evaluation -- Evaluating MoE models

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment