Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang Benchmark MoE ROCm

From Leeroopedia


Knowledge Sources
Domains Performance Tuning, AMD ROCm, MoE
Last Updated 2026-02-10 00:00 GMT

Overview

A benchmarking script that finds optimal Triton kernel configurations for Mixture of Experts (MoE) layers on AMD ROCm GPUs through exhaustive grid search and pruning.

Description

benchmark_moe_rocm.py automates the process of tuning fused MoE kernel configurations for AMD GPUs. It loads model configurations from HuggingFace (e.g., DeepSeek-V3, Grok-1) to determine MoE dimensions such as hidden_size, intermediate_size, num_experts, and num_experts_per_tok.

The script generates a comprehensive grid of Triton tuning parameters including:

  • BLOCK_SIZE_M/N/K: Tile dimensions (16-256 for M/N, 32-256 for K)
  • num_warps: 1, 2, 4, or 8
  • num_stages: Pipeline stages (fixed at 2)
  • waves_per_eu: Wave occupancy (0, 1, 2, 4, 8)
  • matrix_instr_nonkdim: MFMA instruction size (16)
  • kpack: Packing factor (1 or 2)

A pruning phase eliminates invalid or suboptimal configurations based on constraints like shared memory limits (64KB LDS), minimum thread utilization, and GEMM size heuristics. The surviving configs are benchmarked via CUDA event timing with warmup runs, and the best configuration per batch size is written to a JSON file for runtime use by the fused_moe kernel.

Usage

Run from the command line targeting a specific model and batch size. The output JSON file is consumed at runtime by SGLang's fused MoE Triton kernel to select optimal tile configurations.

Code Reference

Source Location

Signature

def main(model, tp_size, dtype: str, batches)

def prune_configs(M, N, K, configs)

def run_grid(bs, model, method, tp_size, dtype: str)

def run_timing(
    num_calls: int, bs: int, d_model: int,
    num_total_experts: int, top_k: int, tp_size: int,
    model_intermediate_size: int, method, config,
    dtype: str, hidden_states_dtype
) -> float

Import

from sglang.srt.layers.moe.fused_moe_triton.fused_moe import (
    fused_moe,
    get_config_file_name,
)

I/O Contract

Inputs

Name Type Required Description
--model string No HuggingFace model ID (default: "hpcai-tech/grok-1")
--dtype string No Data type: float8, float16, or bfloat16 (default: auto)
--tp-size int No Tensor parallelism size (default: 2)
-b / --batches string Yes Comma-separated batch sizes to benchmark

Outputs

Name Type Description
Config JSON file JSON file Best Triton kernel configuration per batch size, written to disk
Console output text Progress bars, timing results, and best configuration per batch

Usage Examples

Benchmark MoE for DeepSeek-V3

python benchmark_moe_rocm.py \
    --model deepseek-ai/DeepSeek-V3 \
    --tp-size 8 \
    --dtype bfloat16 \
    -b 1,2,4,8,16,32,64,128,256

Benchmark with FP8 Quantization

python benchmark_moe_rocm.py \
    --model hpcai-tech/grok-1 \
    --tp-size 2 \
    --dtype float8 \
    -b 1,4,16,64

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment