Implementation:FMInference FlexLLMGen DeepSpeed Sharded MoE

Field	Value
Sources	Repo: FlexLLMGen, Upstream: DeepSpeed
Domains	Distributed_Training, Mixture_Of_Experts
Last Updated	2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed module implementing sharded Mixture of Experts (MoE) layers with Top-K gating, token routing, and expert-parallel All-to-All communication.

Description

The sharded_moe.py file (581 lines) is a vendored copy of DeepSpeed's MoE implementation, adapted from Facebook's fairscale library. It provides the core components for building Mixture of Experts transformer models with efficient distributed execution.

Key components include:

TopKGate -- A gating module that implements Top-1 and Top-2 expert selection as described in the GShard paper. It computes a linear projection to produce expert logits, applies softmax to get gate probabilities, and selects the top-k experts per token. Supports noisy gating policies (Jitter for multiplicative noise, RSample for Gumbel noise) and configurable capacity factors.
MOELayer -- The main MoE layer that orchestrates token dispatch to experts and result combination. It uses _AllToAll for expert-parallel communication, dispatching tokens to the appropriate expert's GPU and gathering results back.
top1gating / top2gating -- Standalone gating functions that compute dispatch masks and combine weights for Top-1 and Top-2 routing respectively, with support for token dropping when expert capacity is exceeded.
_AllToAll -- A custom autograd function wrapping dist.all_to_all_single with proper gradient support for backward pass.

Helper functions include multiplicative_jitter (adds noise for bfloat16 resilience), gumbel_rsample (Gumbel distribution sampling for noisy gating), and optimized einsum rewrites for common MoE tensor operations.

Usage

The MoE components are used within DeepSpeed's training engine when building models with expert parallelism. In FlexLLMGen's benchmark suite, this is part of the vendored DeepSpeed package for baseline performance evaluation.

Code Reference

Field	Value
Repository	FlexLLMGen
File	benchmark/third_party/DeepSpeed/deepspeed/moe/sharded_moe.py
Lines	1-581
Type	AUTO_KEEP (vendored dependency)

Key class signatures:

class TopKGate(Module):
    def __init__(self, model_dim: int, num_experts: int, k: int = 1,
                 capacity_factor: float = 1.0, eval_capacity_factor: float = 1.0,
                 min_capacity: int = 8, noisy_gate_policy: Optional[str] = None,
                 drop_tokens: bool = True, use_rts: bool = True) -> None:
        ...
    def forward(self, input: torch.Tensor, used_token: torch.Tensor = None,
                use_tutel: bool = False) -> Tuple[Tensor, Tensor, Tensor]:
        ...

class MOELayer(Base):
    # Orchestrates token dispatch, expert computation, and result combination
    ...

I/O Contract

Inputs

Parameter	Type	Required	Description
model_dim	int	Yes	Embedding dimension size of the model
num_experts	int	Yes	Number of experts in the MoE layer
k	int	No	Number of experts per token (1 or 2, default: 1)
capacity_factor	float	No	Capacity factor for training (default: 1.0)
eval_capacity_factor	float	No	Capacity factor for evaluation (default: 1.0)
min_capacity	int	No	Minimum expert capacity (default: 8)
noisy_gate_policy	str	No	Noise policy: 'Jitter', 'RSample', or None

Outputs

Output	Type	Description
l_aux	Tensor	Auxiliary load-balancing loss for training
combine_weights	Tensor	Weights for combining expert outputs (shape: [S, E, C])
dispatch_mask	Tensor	Boolean mask for dispatching tokens to experts
exp_counts	Tensor	Token counts per expert for monitoring

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment