Heuristic:Mlc ai Mlc llm Optimization Level Selection

Knowledge Sources	MLC-LLM
Domains	Optimization, Compilation
Last Updated	2026-02-09 19:00 GMT

Overview

Guide for selecting the appropriate compiler optimization level (O0-O3) based on target GPU architecture, quantization mode, and stability requirements.

Description

MLC-LLM provides four optimization presets (O0, O1, O2, O3) that progressively enable more aggressive compiler optimizations. Each level activates a different combination of FlashInfer attention kernels, cuBLAS GEMM dispatch, CUTLASS, FasterTransformer epilogue fusion, CUDA graphs, and IPC allreduce strategies. The choice of optimization level significantly affects inference throughput, memory usage, and compilation time. O2 is the default JIT level and provides the best stability-performance balance for CUDA GPUs.

Usage

Apply this heuristic when compiling models or launching the inference engine to decide which optimization flags to use. The key decision factors are: target GPU architecture (FlashInfer needs sm_80+), quantization mode (cuBLAS only for unquantized/FP8), and whether stability or maximum performance is the priority.

The Insight (Rule of Thumb)

O0 (No optimization): All optimizations disabled. Use for debugging, non-CUDA targets, or when encountering compilation issues.
O1 (Conservative): Enables cuBLAS GEMM, FasterTransformer epilogue fusion, and CUTLASS. No FlashInfer or CUDA graphs. Use for older CUDA GPUs (sm_70-sm_75) or when FlashInfer causes issues.
O2 (Default/Recommended): Enables FlashInfer, cuBLAS GEMM, CUTLASS, and CUDA graphs. No FasterTransformer fusion or IPC allreduce. Best balance of performance and stability for Ampere+ GPUs.
O3 (Extreme): All optimizations enabled including FasterTransformer fusion and AUTO IPC allreduce. May break on some configurations. Use only for maximum throughput when stability is verified.
Trade-off: Higher levels increase compilation time and binary size but improve inference throughput. O3 may cause correctness issues on edge cases.

Reasoning

The optimization flags control which compiler passes are applied during the TVM compilation pipeline. FlashInfer provides optimized attention kernels using GPU-native instructions on Ampere+ architectures, offering 2-3x speedup over TIR-based attention. cuBLAS dispatch offloads matrix multiplications to NVIDIA's hand-tuned BLAS library, improving batch decoding throughput. CUDA graphs reduce kernel launch overhead by capturing and replaying entire execution sequences. However, each optimization adds constraints: FlashInfer requires sm_80+, cuBLAS only helps for unquantized/FP8 models, and CUDA graphs may not capture all execution patterns.

# Optimization presets from compiler_flags.py:198-227
OPT_FLAG_PRESET = {
    "O0": OptimizationFlags(flashinfer=False, cublas_gemm=False, cudagraph=False),
    "O1": OptimizationFlags(flashinfer=False, cublas_gemm=True, faster_transformer=True,
                            cudagraph=False, cutlass=True),
    "O2": OptimizationFlags(flashinfer=True, cublas_gemm=True, faster_transformer=False,
                            cudagraph=True, cutlass=True,
                            ipc_allreduce_strategy=IPCAllReduceStrategyType.NONE),
    "O3": OptimizationFlags(flashinfer=True, cublas_gemm=True, faster_transformer=True,
                            cudagraph=True, cutlass=True,
                            ipc_allreduce_strategy=IPCAllReduceStrategyType.AUTO),
}

cuBLAS is auto-disabled for quantized models from `compiler_flags.py:103-113`:

def _cublas_gemm(target, quantization) -> bool:
    if not target.kind.name in ["cuda", "rocm"]:
        return False
    if not (quantization.name in ["q0f16", "q0bf16", "q0f32"]
            or "e4m3" in quantization.name or "e5m2" in quantization.name):
        return False
    return self.cublas_gemm

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment