Implementation:FMInference FlexLLMGen DeepSpeed Sharded MoE
| Field | Value |
|---|---|
| Sources | Repo: FlexLLMGen, Upstream: DeepSpeed |
| Domains | Distributed_Training, Mixture_Of_Experts |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Vendored DeepSpeed module implementing sharded Mixture of Experts (MoE) layers with Top-K gating, token routing, and expert-parallel All-to-All communication.
Description
The sharded_moe.py file (581 lines) is a vendored copy of DeepSpeed's MoE implementation, adapted from Facebook's fairscale library. It provides the core components for building Mixture of Experts transformer models with efficient distributed execution.
Key components include:
- TopKGate -- A gating module that implements Top-1 and Top-2 expert selection as described in the GShard paper. It computes a linear projection to produce expert logits, applies softmax to get gate probabilities, and selects the top-k experts per token. Supports noisy gating policies (Jitter for multiplicative noise, RSample for Gumbel noise) and configurable capacity factors.
- MOELayer -- The main MoE layer that orchestrates token dispatch to experts and result combination. It uses _AllToAll for expert-parallel communication, dispatching tokens to the appropriate expert's GPU and gathering results back.
- top1gating / top2gating -- Standalone gating functions that compute dispatch masks and combine weights for Top-1 and Top-2 routing respectively, with support for token dropping when expert capacity is exceeded.
- _AllToAll -- A custom autograd function wrapping dist.all_to_all_single with proper gradient support for backward pass.
Helper functions include multiplicative_jitter (adds noise for bfloat16 resilience), gumbel_rsample (Gumbel distribution sampling for noisy gating), and optimized einsum rewrites for common MoE tensor operations.
Usage
The MoE components are used within DeepSpeed's training engine when building models with expert parallelism. In FlexLLMGen's benchmark suite, this is part of the vendored DeepSpeed package for baseline performance evaluation.
Code Reference
| Field | Value |
|---|---|
| Repository | FlexLLMGen |
| File | benchmark/third_party/DeepSpeed/deepspeed/moe/sharded_moe.py |
| Lines | 1-581 |
| Type | AUTO_KEEP (vendored dependency) |
Key class signatures:
class TopKGate(Module):
def __init__(self, model_dim: int, num_experts: int, k: int = 1,
capacity_factor: float = 1.0, eval_capacity_factor: float = 1.0,
min_capacity: int = 8, noisy_gate_policy: Optional[str] = None,
drop_tokens: bool = True, use_rts: bool = True) -> None:
...
def forward(self, input: torch.Tensor, used_token: torch.Tensor = None,
use_tutel: bool = False) -> Tuple[Tensor, Tensor, Tensor]:
...
class MOELayer(Base):
# Orchestrates token dispatch, expert computation, and result combination
...
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| model_dim | int | Yes | Embedding dimension size of the model |
| num_experts | int | Yes | Number of experts in the MoE layer |
| k | int | No | Number of experts per token (1 or 2, default: 1) |
| capacity_factor | float | No | Capacity factor for training (default: 1.0) |
| eval_capacity_factor | float | No | Capacity factor for evaluation (default: 1.0) |
| min_capacity | int | No | Minimum expert capacity (default: 8) |
| noisy_gate_policy | str | No | Noise policy: 'Jitter', 'RSample', or None |
Outputs
| Output | Type | Description |
|---|---|---|
| l_aux | Tensor | Auxiliary load-balancing loss for training |
| combine_weights | Tensor | Weights for combining expert outputs (shape: [S, E, C]) |
| dispatch_mask | Tensor | Boolean mask for dispatching tokens to experts |
| exp_counts | Tensor | Token counts per expert for monitoring |