Implementation:Sgl project Sglang Benchmark MoE ROCm
| Knowledge Sources | |
|---|---|
| Domains | Performance Tuning, AMD ROCm, MoE |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A benchmarking script that finds optimal Triton kernel configurations for Mixture of Experts (MoE) layers on AMD ROCm GPUs through exhaustive grid search and pruning.
Description
benchmark_moe_rocm.py automates the process of tuning fused MoE kernel configurations for AMD GPUs. It loads model configurations from HuggingFace (e.g., DeepSeek-V3, Grok-1) to determine MoE dimensions such as hidden_size, intermediate_size, num_experts, and num_experts_per_tok.
The script generates a comprehensive grid of Triton tuning parameters including:
- BLOCK_SIZE_M/N/K: Tile dimensions (16-256 for M/N, 32-256 for K)
- num_warps: 1, 2, 4, or 8
- num_stages: Pipeline stages (fixed at 2)
- waves_per_eu: Wave occupancy (0, 1, 2, 4, 8)
- matrix_instr_nonkdim: MFMA instruction size (16)
- kpack: Packing factor (1 or 2)
A pruning phase eliminates invalid or suboptimal configurations based on constraints like shared memory limits (64KB LDS), minimum thread utilization, and GEMM size heuristics. The surviving configs are benchmarked via CUDA event timing with warmup runs, and the best configuration per batch size is written to a JSON file for runtime use by the fused_moe kernel.
Usage
Run from the command line targeting a specific model and batch size. The output JSON file is consumed at runtime by SGLang's fused MoE Triton kernel to select optimal tile configurations.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: 3rdparty/amd/tuning/benchmark_moe_rocm.py
- Lines: 1-381
Signature
def main(model, tp_size, dtype: str, batches)
def prune_configs(M, N, K, configs)
def run_grid(bs, model, method, tp_size, dtype: str)
def run_timing(
num_calls: int, bs: int, d_model: int,
num_total_experts: int, top_k: int, tp_size: int,
model_intermediate_size: int, method, config,
dtype: str, hidden_states_dtype
) -> float
Import
from sglang.srt.layers.moe.fused_moe_triton.fused_moe import (
fused_moe,
get_config_file_name,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model | string | No | HuggingFace model ID (default: "hpcai-tech/grok-1") |
| --dtype | string | No | Data type: float8, float16, or bfloat16 (default: auto) |
| --tp-size | int | No | Tensor parallelism size (default: 2) |
| -b / --batches | string | Yes | Comma-separated batch sizes to benchmark |
Outputs
| Name | Type | Description |
|---|---|---|
| Config JSON file | JSON file | Best Triton kernel configuration per batch size, written to disk |
| Console output | text | Progress bars, timing results, and best configuration per batch |
Usage Examples
Benchmark MoE for DeepSeek-V3
python benchmark_moe_rocm.py \
--model deepseek-ai/DeepSeek-V3 \
--tp-size 8 \
--dtype bfloat16 \
-b 1,2,4,8,16,32,64,128,256
Benchmark with FP8 Quantization
python benchmark_moe_rocm.py \
--model hpcai-tech/grok-1 \
--tp-size 2 \
--dtype float8 \
-b 1,4,16,64