Principle:Unslothai Unsloth MoE Expert Routing
| Knowledge Sources | |
|---|---|
| Domains | MoE, Model_Architecture, Token_Routing |
| Last Updated | 2026-02-07 08:40 GMT |
Overview
Mechanism for dynamically dispatching input tokens to a subset of specialized expert networks within a Mixture-of-Experts layer.
Description
MoE Expert Routing determines which expert sub-networks process each input token. A learned router (gate) network produces logits over the expert set, from which top-k experts are selected per token. Routing weights are computed via softmax or sigmoid activation, optionally renormalized after selection. The routing process produces three key outputs: selected expert IDs, routing weights for output aggregation, and gather/scatter indices for efficient token permutation between token-order and expert-order representations.
Different model architectures implement routing differently: Qwen3 uses softmax with renormalization, while Llama4 uses sigmoid activation without renormalization. Both approaches can be optimized with fused Triton grouped GEMM kernels that eliminate separate permutation steps.
Usage
Apply this principle when implementing or optimizing MoE transformer layers. The routing strategy directly impacts model quality (expert utilization, load balancing) and computational efficiency (token permutation cost, kernel fusion opportunities).
Theoretical Basis
For a router gate with E experts:
- Gating: where is softmax or sigmoid
- Selection: selects k experts per token
- Routing weights: (with renormalization)
- Token dispatch: Group tokens by assigned expert using argsort
Pseudo-code Logic:
# Abstract routing algorithm
logits = hidden_states @ gate_weight # [batch*seq, num_experts]
weights = softmax(logits) # or sigmoid
topk_weights, topk_ids = torch.topk(weights, k)
if renormalize:
topk_weights = topk_weights / topk_weights.sum(dim=-1, keepdim=True)
# Group tokens by expert
counts, gather_idx = get_routing_indices(topk_ids, num_experts)
permuted_tokens = permute(tokens, gather_idx, topk)