Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Unslothai Unsloth MoE Kernel Autotuning

From Leeroopedia


Knowledge Sources
Domains MoE, Kernel_Optimization, GPU_Computing
Last Updated 2026-02-07 08:40 GMT

Overview

Technique for automatically selecting optimal Triton kernel configurations for Mixture-of-Experts grouped GEMM operations through empirical benchmarking and persistent caching.

Description

MoE Kernel Autotuning addresses the challenge that optimal GPU kernel configurations (tile sizes, warp counts, pipeline stages, TMA usage) vary significantly across different model architectures and GPU hardware. Rather than using fixed configurations, the system generates a pruned search space of candidate configurations, benchmarks each on the target hardware with representative data, selects the fastest, and caches the result for reuse across sessions.

Usage

Apply this principle when deploying MoE models on new GPU hardware or with new model architectures where default kernel configurations may be suboptimal. The autotuning runs once at first training start and results are persisted.

Theoretical Basis

The autotuning process follows a generate-prune-benchmark-cache pipeline:

  1. Configuration generation: Combinatorial expansion of block sizes, warps, stages
  2. Constraint pruning: Remove configs exceeding shared memory, with incompatible TMA/permutation settings
  3. Empirical benchmarking: Time each remaining config on representative dummy data
  4. Persistent caching: Store winning configs keyed by model parameters and device capability

Pseudo-code Logic:

# Abstract autotuning algorithm
configs = generate_all_configs(block_sizes, warps, stages)
configs = prune_invalid(configs, device_smem, problem_size)
best = benchmark(configs, dummy_data)  # Triton handles this
cache_to_disk(cache_key(model_params, device), best)

Key constraints:

  • Shared memory: Failed to parse (syntax error): {\displaystyle \text{SMEM} = \text{stages} \times K \times (M + N) \times \text{dtype\_size}}
  • Block size vs. tokens: Failed to parse (syntax error): {\displaystyle M_{\text{block}} \leq 2 \times \text{tokens\_per\_expert}}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment