Principle:LLMBook zh LLMBook zh github io Mixture of Experts Routing
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Architecture |
| Last Updated | 2026-02-08 04:29 GMT |
Overview
Sparse architecture pattern that routes each token to a subset of expert networks via a gating function, scaling model capacity without proportional compute increase.
Description
Mixture of Experts (MoE) is a conditional computation technique that replaces a single feed-forward network with multiple parallel expert networks and a routing (gating) mechanism. For each input token, the gating network selects the top-k experts (typically k=1 or k=2) and computes a weighted combination of their outputs. This allows the model to have a very large number of parameters (capacity) while only activating a fraction of them for each token (efficiency). MoE is the architecture behind models like Mixtral, Switch Transformer, and GShard.
Usage
Use this principle when studying scalable Transformer architectures that achieve high parameter counts without proportional increases in computation. MoE layers typically replace the dense feed-forward (MLP) layers in Transformer blocks. The routing decision is made per-token, so different tokens in the same batch may be processed by different experts.
Theoretical Basis
The MoE layer computes:
Where:
- is the -th expert network (typically a standard FFN)
- The routing weights are:
- is the gating network (a linear projection)
- TopK selects the experts with highest gate logits
Pseudo-code Logic:
# Abstract algorithm description (NOT real implementation)
gate_logits = gate(input) # linear projection
weights, selected = topk(gate_logits, k) # select top-k experts
weights = softmax(weights) # normalize weights
output = sum(w_i * expert_i(input) for i in selected)