Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Mlc ai Mlc llm BLAS Dispatch Decision

From Leeroopedia





Knowledge Sources
Domains Optimization, GPU_Acceleration
Last Updated 2026-02-09 19:00 GMT

Overview

cuBLAS/hipBLAS GEMM dispatch is automatically restricted to unquantized and FP8 models, and excludes single-batch decode operations to maximize decode throughput.

Description

The BLAS dispatch compiler pass selectively routes matrix multiplication operations to cuBLAS (CUDA) or hipBLAS (ROCm) hardware-accelerated libraries. Two important filtering rules apply: (1) cuBLAS is only activated for models using no quantization (`q0f16`, `q0bf16`, `q0f32`) or FP8 quantization (`e4m3`, `e5m2`), because INT4/INT3 quantized models use specialized dequantize-fused kernels that are faster than cuBLAS for those patterns; (2) the BLAS dispatch intentionally excludes single-batch decode functions, routing only batch decode and all other operations to cuBLAS. This is because single-batch decode is a GEMV (matrix-vector) operation where TIR-generated kernels outperform cuBLAS.

Usage

Apply this heuristic when choosing between quantization modes and understanding why cuBLAS is not used for a specific configuration. If you see no cuBLAS dispatch despite enabling it, check the quantization mode.

The Insight (Rule of Thumb)

  • Action: Enable `cublas_gemm=1` only when using unquantized (FP16/BF16/FP32) or FP8 quantized models.
  • Value: cuBLAS is auto-disabled for INT4/INT3/AWQ quantized models.
  • Trade-off: cuBLAS provides optimized GEMM for large batch sizes but adds library overhead. For single-batch decode (GEMV), TIR kernels are faster.
  • ROCm: hipBLAS is used instead of cuBLAS with the same dispatch logic.

Reasoning

For quantized models (INT4, INT3), the computation pattern is dequantize-then-matmul, which is fused into a single kernel by TVM's `FuseDequantizeMatmulEwise` pass. This fused kernel avoids the intermediate materialization of dequantized weights, saving memory bandwidth. cuBLAS cannot express this fused pattern, so using it would require separate dequantization and matmul steps, which is slower.

For single-batch decode, the operation is a GEMV (matrix-vector multiply) where the batch dimension M=1. cuBLAS GEMM is optimized for large M values; for M=1, TIR-generated GEMV kernels (optimized by Dlight's GEMV schedule) are more efficient.

# From blas_dispatch.py:36-40 - Exclude single batch decode from cuBLAS
model_names = [
    gv.name_hint for gv, func in mod.functions.items()
    if isinstance(func, relax.Function)
]
# exclude single batch decode
model_names = [name for name in model_names if "batch" in name or "decode" not in name]
# From compiler_flags.py:103-113 - cuBLAS only for unquantized/FP8
def _cublas_gemm(target, quantization) -> bool:
    if not target.kind.name in ["cuda", "rocm"]:
        return False
    if not (quantization.name in ["q0f16", "q0bf16", "q0f32"]
            or "e4m3" in quantization.name or "e5m2" in quantization.name):
        return False
    return self.cublas_gemm

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment