Principle:FMInference FlexLLMGen Sharded Mixture Of Experts
| Field | Value |
|---|---|
| Sources | Paper: GShard, Upstream: DeepSpeed |
| Domains | Distributed_Training, Mixture_Of_Experts |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A distributed computation pattern where tokens are dynamically routed to a subset of specialized expert networks distributed across multiple GPUs, enabling conditional computation that scales model capacity without proportionally scaling compute.
Description
Sharded Mixture of Experts (MoE) is a technique for building very large neural networks that remain computationally efficient by activating only a small fraction of parameters for each input token. The key idea is that each layer contains multiple "expert" sub-networks, and a learned gating function decides which expert(s) process each token.
The sharded aspect means experts are distributed across multiple GPUs (expert parallelism). Each GPU holds a subset of experts, and tokens must be communicated between GPUs via All-to-All collective operations.
The core mechanisms include:
- Top-K gating -- A linear projection followed by softmax produces per-token probabilities over experts. Top-1 selects the single best expert; Top-2 selects two experts and interpolates their outputs. The gating function is learned end-to-end.
- Capacity factor -- Each expert has a fixed buffer capacity proportional to (num_tokens / num_experts) * capacity_factor. Tokens exceeding this capacity are dropped, preventing load imbalance from overwhelming individual experts.
- Load balancing loss -- An auxiliary loss (l_aux) encourages uniform expert utilization, preventing "expert collapse" where only a few experts are selected.
- Noisy gating -- Adding noise (multiplicative jitter or Gumbel sampling) to gate logits during training improves exploration and helps maintain balanced expert usage.
- Token dropping -- When an expert's capacity is exceeded, excess tokens are dropped to maintain computational efficiency, though this can be disabled at the cost of variable compute.
Usage
MoE is used when scaling model capacity beyond what is feasible with dense models. It enables trillion-parameter models that use the same compute per token as much smaller dense models. The approach is relevant to FlexLLMGen's benchmarking of large-scale model inference.
Theoretical Basis
The capacity of an MoE layer scales linearly with the number of experts E, but compute per token remains constant (proportional to k experts times the expert size). For Top-1 gating, compute is O(d * d_expert) per token regardless of the total number of experts. The auxiliary load balancing loss l_aux is computed as:
l_aux = num_experts * sum(fraction_of_tokens_per_expert * mean_gate_probability_per_expert)
This is minimized when tokens are uniformly distributed across experts. The All-to-All communication cost is O(S * d / P) per GPU for sequence length S, model dimension d, and P GPUs.