Implementation:Sgl project Sglang Benchmark MoE ROCm

Knowledge Sources	Sgl_project_Sglang
Domains	Performance Tuning, AMD ROCm, MoE
Last Updated	2026-02-10 00:00 GMT

Overview

A benchmarking script that finds optimal Triton kernel configurations for Mixture of Experts (MoE) layers on AMD ROCm GPUs through exhaustive grid search and pruning.

Description

benchmark_moe_rocm.py automates the process of tuning fused MoE kernel configurations for AMD GPUs. It loads model configurations from HuggingFace (e.g., DeepSeek-V3, Grok-1) to determine MoE dimensions such as hidden_size, intermediate_size, num_experts, and num_experts_per_tok.

The script generates a comprehensive grid of Triton tuning parameters including:

BLOCK_SIZE_M/N/K: Tile dimensions (16-256 for M/N, 32-256 for K)
num_warps: 1, 2, 4, or 8
num_stages: Pipeline stages (fixed at 2)
waves_per_eu: Wave occupancy (0, 1, 2, 4, 8)
matrix_instr_nonkdim: MFMA instruction size (16)
kpack: Packing factor (1 or 2)

A pruning phase eliminates invalid or suboptimal configurations based on constraints like shared memory limits (64KB LDS), minimum thread utilization, and GEMM size heuristics. The surviving configs are benchmarked via CUDA event timing with warmup runs, and the best configuration per batch size is written to a JSON file for runtime use by the fused_moe kernel.

Usage

Run from the command line targeting a specific model and batch size. The output JSON file is consumed at runtime by SGLang's fused MoE Triton kernel to select optimal tile configurations.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: 3rdparty/amd/tuning/benchmark_moe_rocm.py
Lines: 1-381

Signature

def main(model, tp_size, dtype: str, batches)

def prune_configs(M, N, K, configs)

def run_grid(bs, model, method, tp_size, dtype: str)

def run_timing(
    num_calls: int, bs: int, d_model: int,
    num_total_experts: int, top_k: int, tp_size: int,
    model_intermediate_size: int, method, config,
    dtype: str, hidden_states_dtype
) -> float

Import

from sglang.srt.layers.moe.fused_moe_triton.fused_moe import (
    fused_moe,
    get_config_file_name,
)

I/O Contract

Inputs

Name	Type	Required	Description
--model	string	No	HuggingFace model ID (default: "hpcai-tech/grok-1")
--dtype	string	No	Data type: float8, float16, or bfloat16 (default: auto)
--tp-size	int	No	Tensor parallelism size (default: 2)
-b / --batches	string	Yes	Comma-separated batch sizes to benchmark

Outputs

Name	Type	Description
Config JSON file	JSON file	Best Triton kernel configuration per batch size, written to disk
Console output	text	Progress bars, timing results, and best configuration per batch

Usage Examples

Benchmark MoE for DeepSeek-V3

python benchmark_moe_rocm.py \
    --model deepseek-ai/DeepSeek-V3 \
    --tp-size 8 \
    --dtype bfloat16 \
    -b 1,2,4,8,16,32,64,128,256

Benchmark with FP8 Quantization

python benchmark_moe_rocm.py \
    --model hpcai-tech/grok-1 \
    --tp-size 2 \
    --dtype float8 \
    -b 1,4,16,64

Related Pages

Environment:Sgl_project_Sglang_ROCm

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment