Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Unslothai Unsloth MoE Expert Routing

From Leeroopedia


Knowledge Sources
Domains MoE, Model_Architecture, Token_Routing
Last Updated 2026-02-07 08:40 GMT

Overview

Mechanism for dynamically dispatching input tokens to a subset of specialized expert networks within a Mixture-of-Experts layer.

Description

MoE Expert Routing determines which expert sub-networks process each input token. A learned router (gate) network produces logits over the expert set, from which top-k experts are selected per token. Routing weights are computed via softmax or sigmoid activation, optionally renormalized after selection. The routing process produces three key outputs: selected expert IDs, routing weights for output aggregation, and gather/scatter indices for efficient token permutation between token-order and expert-order representations.

Different model architectures implement routing differently: Qwen3 uses softmax with renormalization, while Llama4 uses sigmoid activation without renormalization. Both approaches can be optimized with fused Triton grouped GEMM kernels that eliminate separate permutation steps.

Usage

Apply this principle when implementing or optimizing MoE transformer layers. The routing strategy directly impacts model quality (expert utilization, load balancing) and computational efficiency (token permutation cost, kernel fusion opportunities).

Theoretical Basis

For a router gate Gd×E with E experts:

  1. Gating: g=σ(hG) where σ is softmax or sigmoid
  2. Selection: topk(g,k) selects k experts per token
  3. Routing weights: wi=gijtopkgj (with renormalization)
  4. Token dispatch: Group tokens by assigned expert using argsort

Pseudo-code Logic:

# Abstract routing algorithm
logits = hidden_states @ gate_weight  # [batch*seq, num_experts]
weights = softmax(logits)  # or sigmoid
topk_weights, topk_ids = torch.topk(weights, k)
if renormalize:
    topk_weights = topk_weights / topk_weights.sum(dim=-1, keepdim=True)
# Group tokens by expert
counts, gather_idx = get_routing_indices(topk_ids, num_experts)
permuted_tokens = permute(tokens, gather_idx, topk)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment