Heuristic:Mlc ai Mlc llm BLAS Dispatch Decision
| Knowledge Sources | |
|---|---|
| Domains | Optimization, GPU_Acceleration |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
cuBLAS/hipBLAS GEMM dispatch is automatically restricted to unquantized and FP8 models, and excludes single-batch decode operations to maximize decode throughput.
Description
The BLAS dispatch compiler pass selectively routes matrix multiplication operations to cuBLAS (CUDA) or hipBLAS (ROCm) hardware-accelerated libraries. Two important filtering rules apply: (1) cuBLAS is only activated for models using no quantization (`q0f16`, `q0bf16`, `q0f32`) or FP8 quantization (`e4m3`, `e5m2`), because INT4/INT3 quantized models use specialized dequantize-fused kernels that are faster than cuBLAS for those patterns; (2) the BLAS dispatch intentionally excludes single-batch decode functions, routing only batch decode and all other operations to cuBLAS. This is because single-batch decode is a GEMV (matrix-vector) operation where TIR-generated kernels outperform cuBLAS.
Usage
Apply this heuristic when choosing between quantization modes and understanding why cuBLAS is not used for a specific configuration. If you see no cuBLAS dispatch despite enabling it, check the quantization mode.
The Insight (Rule of Thumb)
- Action: Enable `cublas_gemm=1` only when using unquantized (FP16/BF16/FP32) or FP8 quantized models.
- Value: cuBLAS is auto-disabled for INT4/INT3/AWQ quantized models.
- Trade-off: cuBLAS provides optimized GEMM for large batch sizes but adds library overhead. For single-batch decode (GEMV), TIR kernels are faster.
- ROCm: hipBLAS is used instead of cuBLAS with the same dispatch logic.
Reasoning
For quantized models (INT4, INT3), the computation pattern is dequantize-then-matmul, which is fused into a single kernel by TVM's `FuseDequantizeMatmulEwise` pass. This fused kernel avoids the intermediate materialization of dequantized weights, saving memory bandwidth. cuBLAS cannot express this fused pattern, so using it would require separate dequantization and matmul steps, which is slower.
For single-batch decode, the operation is a GEMV (matrix-vector multiply) where the batch dimension M=1. cuBLAS GEMM is optimized for large M values; for M=1, TIR-generated GEMV kernels (optimized by Dlight's GEMV schedule) are more efficient.
# From blas_dispatch.py:36-40 - Exclude single batch decode from cuBLAS
model_names = [
gv.name_hint for gv, func in mod.functions.items()
if isinstance(func, relax.Function)
]
# exclude single batch decode
model_names = [name for name in model_names if "batch" in name or "decode" not in name]
# From compiler_flags.py:103-113 - cuBLAS only for unquantized/FP8
def _cublas_gemm(target, quantization) -> bool:
if not target.kind.name in ["cuda", "rocm"]:
return False
if not (quantization.name in ["q0f16", "q0bf16", "q0f32"]
or "e4m3" in quantization.name or "e5m2" in quantization.name):
return False
return self.cublas_gemm