Principle:Microsoft DeepSpeedExamples DeepSpeed MoE Training
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Repository | Microsoft/DeepSpeedExamples |
| Title | DeepSpeed_MoE_Training |
| Sources | Paper: Switch Transformer: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, Paper: GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Doc: DeepSpeed MoE Tutorial |
| Domains | Deep_Learning, Model_Architecture, Distributed_Training |
| Related Implementation | Implementation:Microsoft_DeepSpeedExamples_Net_DeepSpeed |
Overview
A model architecture technique that replaces dense feed-forward layers with Mixture of Experts (MoE) layers to increase model capacity without proportional compute increase.
Description
Mixture of Experts (MoE) is a conditional computation technique that scales model capacity (parameter count) while keeping the per-sample computational cost approximately constant. Instead of routing every input through every parameter, MoE uses a learned gating mechanism to route each input to a subset of specialized "expert" sub-networks.
In the context of the CIFAR-10 DeepSpeed example, MoE replaces the final fully-connected classification layer with multiple expert networks plus a gating function. This demonstrates the core MoE integration pattern:
- Expert Networks -- Multiple copies of a sub-network (e.g.,
nn.Linear(84, 84)) serve as experts. Each expert learns to specialize on different parts of the input distribution. - Gating Network -- A learned router (typically a linear layer + softmax) assigns input tokens to experts based on a routing score.
- Top-K Selection -- Only the top-k scoring experts process each input, keeping compute constant regardless of the total number of experts.
- Expert Parallelism -- Experts can be distributed across GPUs (expert parallel world), allowing the model to have more experts than fit on a single device.
DeepSpeed provides deepspeed.moe.layer.MoE as a drop-in replacement for dense layers, handling gating, expert dispatch, load balancing, and expert parallelism transparently.
Theoretical Basis
Gating Mechanism
The gating function computes routing scores for each expert:
G(x) = TopK(softmax(W_g * x + noise))
where:
- x is the input tensor
- W_g is the learnable gating weight matrix (shape: [hidden_size, num_experts])
- noise is optional noise injected for load balancing (controlled by
noisy_gate_policy) - TopK selects the k highest-scoring experts
Expert Output Computation
The final output is a weighted sum of the selected expert outputs:
y = sum_{i in TopK} G(x)_i * E_i(x)
where:
- G(x)_i is the gating weight for expert i
- E_i(x) is the output of expert i applied to input x
- The sum is only over the top-k experts (typically k=1 or k=2)
Load Balancing
Without intervention, MoE models tend to collapse -- routing most inputs to a small subset of experts. Several strategies address this:
| Strategy | Description | DeepSpeed Parameter |
|---|---|---|
| Noisy Gating (RSample) | Add Gaussian noise to gating logits before softmax, encouraging exploration | noisy_gate_policy='RSample'
|
| Noisy Gating (Jitter) | Multiply inputs by uniform noise before gating | noisy_gate_policy='Jitter'
|
| Capacity Factor | Limit the maximum number of tokens each expert can process | min_capacity
|
| Auxiliary Loss | Add a loss term that penalizes unbalanced expert utilization | Internal to DeepSpeed MoE |
Residual MoE
The residual MoE variant (mlp_type='residual') adds a residual connection around the MoE layer:
y = MoE(x) + coefficient * Dense(x)
This provides a fallback path through a dense layer, improving training stability when expert routing is suboptimal. The coefficient is a learned scalar.
Expert Parallelism
Expert parallelism distributes experts across GPUs within a group:
GPU 0: Expert 0 GPU 1: Expert 1
\ /
+--- AllToAll comm ---+
/ \
GPU 0: Process tokens GPU 1: Process tokens
for Expert 0 for Expert 1
The ep_world_size parameter controls the size of the expert parallel group. With ep_world_size=2 and 2 experts, each GPU holds one expert. With ep_world_size=1, all experts are replicated on every GPU.
MoE vs Dense Comparison
| Aspect | Dense Network | MoE Network |
|---|---|---|
| Parameters | Fixed (all active) | Scaled by num_experts (sparse activation) |
| FLOPs per input | Proportional to total parameters | Proportional to k experts only |
| Memory | All parameters on each GPU | Distributed via expert parallelism |
| Training complexity | Standard backprop | Requires load balancing + expert routing |
| Communication | AllReduce gradients | AllToAll for expert dispatch + AllReduce for non-expert params |
DeepSpeed MoE Layer API
deepspeed.moe.layer.MoE(
hidden_size=84, # Input/output dimension
expert=nn.Linear(84, 84), # Expert module (cloned for each expert)
num_experts=4, # Total number of experts
ep_size=2, # Expert parallel world size
use_residual=False, # Enable residual MoE
k=1, # Top-k routing
min_capacity=0, # Minimum expert capacity
noisy_gate_policy='RSample', # Gating noise policy
)
The MoE layer returns a tuple (output, gate_loss, expert_count) where:
- output -- The routed expert output tensor
- gate_loss -- Auxiliary load balancing loss (can be added to training loss)
- expert_count -- Dictionary tracking per-expert utilization
MoE Parameter Groups
When using ZeRO optimization with MoE, expert parameters and non-expert parameters must be placed in separate optimizer parameter groups. This is because expert parameters have different communication patterns (AllToAll vs AllReduce):
def create_moe_param_groups(model):
"""Create separate parameter groups for each expert."""
parameters = {"params": [p for p in model.parameters()], "name": "parameters"}
return split_params_into_different_moe_groups_for_optimizer(parameters)
This is controlled by the --moe-param-group CLI flag.
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Net_DeepSpeed -- CNN with optional MoE layer for CIFAR-10
- Principle:Microsoft_DeepSpeedExamples_DeepSpeed_Engine_Init -- Engine initialization that manages MoE training
- Principle:Microsoft_DeepSpeedExamples_DeepSpeed_CLI_Integration -- MoE-related CLI arguments
- Principle:Microsoft_DeepSpeedExamples_Classification_Evaluation -- Evaluating MoE models