Principle:Unslothai Unsloth MoE Expert Routing

Knowledge Sources	Switch Transformers Mixtral of Experts Unsloth
Domains	MoE, Model_Architecture, Token_Routing
Last Updated	2026-02-07 08:40 GMT

Overview

Mechanism for dynamically dispatching input tokens to a subset of specialized expert networks within a Mixture-of-Experts layer.

Description

MoE Expert Routing determines which expert sub-networks process each input token. A learned router (gate) network produces logits over the expert set, from which top-k experts are selected per token. Routing weights are computed via softmax or sigmoid activation, optionally renormalized after selection. The routing process produces three key outputs: selected expert IDs, routing weights for output aggregation, and gather/scatter indices for efficient token permutation between token-order and expert-order representations.

Different model architectures implement routing differently: Qwen3 uses softmax with renormalization, while Llama4 uses sigmoid activation without renormalization. Both approaches can be optimized with fused Triton grouped GEMM kernels that eliminate separate permutation steps.

Usage

Apply this principle when implementing or optimizing MoE transformer layers. The routing strategy directly impacts model quality (expert utilization, load balancing) and computational efficiency (token permutation cost, kernel fusion opportunities).

Theoretical Basis

For a router gate $G \in ℝ^{d \times E}$ with E experts:

Gating: $g = σ (h \cdot G)$ where $σ$ is softmax or sigmoid
Selection: $topk (g, k)$ selects k experts per token
Routing weights: $w_{i} = \frac{g_{i}}{\sum_{j \in topk} g_{j}}$ (with renormalization)
Token dispatch: Group tokens by assigned expert using argsort

Pseudo-code Logic:

# Abstract routing algorithm
logits = hidden_states @ gate_weight  # [batch*seq, num_experts]
weights = softmax(logits)  # or sigmoid
topk_weights, topk_ids = torch.topk(weights, k)
if renormalize:
    topk_weights = topk_weights / topk_weights.sum(dim=-1, keepdim=True)
# Group tokens by expert
counts, gather_idx = get_routing_indices(topk_ids, num_experts)
permuted_tokens = permute(tokens, gather_idx, topk)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment