Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen DeepSpeed Sharded MoE

From Leeroopedia


Field Value
Sources Repo: FlexLLMGen, Upstream: DeepSpeed
Domains Distributed_Training, Mixture_Of_Experts
Last Updated 2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed module implementing sharded Mixture of Experts (MoE) layers with Top-K gating, token routing, and expert-parallel All-to-All communication.

Description

The sharded_moe.py file (581 lines) is a vendored copy of DeepSpeed's MoE implementation, adapted from Facebook's fairscale library. It provides the core components for building Mixture of Experts transformer models with efficient distributed execution.

Key components include:

  • TopKGate -- A gating module that implements Top-1 and Top-2 expert selection as described in the GShard paper. It computes a linear projection to produce expert logits, applies softmax to get gate probabilities, and selects the top-k experts per token. Supports noisy gating policies (Jitter for multiplicative noise, RSample for Gumbel noise) and configurable capacity factors.
  • MOELayer -- The main MoE layer that orchestrates token dispatch to experts and result combination. It uses _AllToAll for expert-parallel communication, dispatching tokens to the appropriate expert's GPU and gathering results back.
  • top1gating / top2gating -- Standalone gating functions that compute dispatch masks and combine weights for Top-1 and Top-2 routing respectively, with support for token dropping when expert capacity is exceeded.
  • _AllToAll -- A custom autograd function wrapping dist.all_to_all_single with proper gradient support for backward pass.

Helper functions include multiplicative_jitter (adds noise for bfloat16 resilience), gumbel_rsample (Gumbel distribution sampling for noisy gating), and optimized einsum rewrites for common MoE tensor operations.

Usage

The MoE components are used within DeepSpeed's training engine when building models with expert parallelism. In FlexLLMGen's benchmark suite, this is part of the vendored DeepSpeed package for baseline performance evaluation.

Code Reference

Field Value
Repository FlexLLMGen
File benchmark/third_party/DeepSpeed/deepspeed/moe/sharded_moe.py
Lines 1-581
Type AUTO_KEEP (vendored dependency)

Key class signatures:

class TopKGate(Module):
    def __init__(self, model_dim: int, num_experts: int, k: int = 1,
                 capacity_factor: float = 1.0, eval_capacity_factor: float = 1.0,
                 min_capacity: int = 8, noisy_gate_policy: Optional[str] = None,
                 drop_tokens: bool = True, use_rts: bool = True) -> None:
        ...
    def forward(self, input: torch.Tensor, used_token: torch.Tensor = None,
                use_tutel: bool = False) -> Tuple[Tensor, Tensor, Tensor]:
        ...

class MOELayer(Base):
    # Orchestrates token dispatch, expert computation, and result combination
    ...

I/O Contract

Inputs

Parameter Type Required Description
model_dim int Yes Embedding dimension size of the model
num_experts int Yes Number of experts in the MoE layer
k int No Number of experts per token (1 or 2, default: 1)
capacity_factor float No Capacity factor for training (default: 1.0)
eval_capacity_factor float No Capacity factor for evaluation (default: 1.0)
min_capacity int No Minimum expert capacity (default: 8)
noisy_gate_policy str No Noise policy: 'Jitter', 'RSample', or None

Outputs

Output Type Description
l_aux Tensor Auxiliary load-balancing loss for training
combine_weights Tensor Weights for combining expert outputs (shape: [S, E, C])
dispatch_mask Tensor Boolean mask for dispatching tokens to experts
exp_counts Tensor Token counts per expert for monitoring

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment