Environment:Mlc ai Mlc llm CUDA GPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Acceleration, Deep_Learning |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Linux-based NVIDIA GPU environment with CUDA toolkit, cuBLAS, FlashInfer, and optional Triton for high-performance LLM inference compilation and serving.
Description
This environment provides GPU-accelerated inference for MLC-LLM on NVIDIA hardware. It includes the full CUDA toolkit for kernel compilation, cuBLAS/hipBLAS for GEMM dispatch, FlashInfer for optimized attention kernels and GPU sampling, and Triton for FP8 quantized matmul kernels. The environment supports multi-GPU tensor parallelism via IPC allreduce and CUDA graph capture for reduced kernel launch overhead. FlashInfer requires CUDA compute capability >= 80 (Ampere or newer).
Usage
Use this environment for any model compilation targeting CUDA GPUs, REST API serving with GPU acceleration, or Python engine inference. It is the mandatory prerequisite for running the compilation pipeline with optimization level O2 or O3, which enable FlashInfer, cuBLAS GEMM dispatch, CUDA graphs, and CUTLASS kernels.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+ recommended) | Windows supported via WSL2 or native DLL |
| Hardware | NVIDIA GPU with CUDA support | FlashInfer requires compute capability >= 80 (A100, RTX 3090+) |
| VRAM | Minimum 4GB | 7B models need ~4-8GB (quantized), 70B models need multi-GPU |
| Disk | 10GB+ SSD | For compiled model libraries and cached JIT artifacts |
Dependencies
System Packages
- `cuda-toolkit` (CUDA 12.x recommended)
- `cudnn` (for cuDNN acceleration)
- `cmake` < 4.0
- `git`
- `bzip2`
Python Packages
- `torch` (PyTorch, used for weight conversion)
- `apache-tvm-ffi` (TVM FFI bindings, core runtime)
- `flashinfer-python` (Linux only, for FlashInfer attention kernels)
- `ml_dtypes` >= 0.5.1
- `transformers` (for model config and tokenizers)
- `safetensors` (for weight loading)
- `sentencepiece` (for tokenizer support)
- `tiktoken` (for tokenizer support)
Credentials
The following environment variables may be needed:
- `MLC_LLM_HOME`: Override the default cache directory for compiled model libraries (default: `~/.cache/mlc_llm`).
- `MLC_JIT_POLICY`: Controls JIT compilation behavior. Values: `ON` (default), `OFF`, `REDO`, `READONLY`.
- `MLC_MULTI_ARCH`: Comma-separated CUDA architectures for multi-arch fatbin compilation. Example: `70,72,75,80,86,87,89,90a`.
- `MLC_TEMP_DIR`: Override temporary directory for compilation artifacts.
- `SKIP_LOADING_MLCLLM_SO`: Set to `1` to skip loading the MLC-LLM shared library.
Quick Install
# Install core Python dependencies
pip install mlc-llm
# Or install individual packages
pip install apache-tvm-ffi torch transformers safetensors sentencepiece tiktoken
pip install flashinfer-python # Linux only, requires CUDA >= sm_80
pip install ml_dtypes>=0.5.1
Code Evidence
FlashInfer requires CUDA arch >= 80, from `compiler_flags.py:87-101`:
def _flashinfer(target) -> bool:
from mlc_llm.support.auto_target import detect_cuda_arch_list
if not self.flashinfer:
return False
if target.kind.name != "cuda":
return False
arch_list = detect_cuda_arch_list(target)
for arch in arch_list:
if arch < 80:
logger.warning("flashinfer is not supported on CUDA arch < 80")
return False
return True
cuBLAS dispatch requires CUDA or ROCm target, from `blas_dispatch.py:20-32`:
def __init__(self, target: tvm.target.Target) -> None:
if target.kind.name == "cuda":
self.has_blas = tvm.get_global_func("relax.ext.cublas", True)
if not self.has_blas:
raise Exception("cuBLAS is not enabled.")
self.patterns = get_patterns_with_prefix("cublas")
elif target.kind.name == "rocm":
self.has_blas = tvm.get_global_func("relax.ext.hipblas", True)
if not self.has_blas:
raise Exception("hipBLAS is not enabled.")
else:
raise Exception(f"Unsupported target {target.kind.name} for BLAS dispatch.")
Triton kernel dispatch is CUDA-only, from `dispatch_triton_kernel.py:169-174`:
def transform_module(self, mod: IRModule, _ctx: tvm.transform.PassContext) -> IRModule:
if self.target.kind.name != "cuda":
return mod
return _Rewriter(mod, self.target).transform()
CUDA multi-arch detection from `auto_target.py:313-328`:
def detect_cuda_arch_list(target: Target) -> List[int]:
assert target.kind.name == "cuda"
if MLC_MULTI_ARCH is not None:
multi_arch = [convert_to_num(x) for x in MLC_MULTI_ARCH.split(",")]
else:
assert target.arch.startswith("sm_")
multi_arch = [convert_to_num(target.arch[3:])]
return list(set(multi_arch))
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `cuBLAS is not enabled.` | TVM built without cuBLAS support | Rebuild TVM with `USE_CUBLAS=ON` in cmake config |
| `flashinfer is not supported on CUDA arch < 80` | GPU too old for FlashInfer | Use O0 or O1 optimization level, or upgrade GPU to Ampere+ |
| `Insufficient GPU memory error` | Model weights + buffers exceed VRAM | Set larger `gpu_memory_utilization`, use quantization, or enable tensor parallelism |
| `Cannot find compilation output, compilation failed` | JIT compilation failed | Check CUDA toolkit installation and nvcc availability |
| `JIT is disabled by MLC_JIT_POLICY=OFF` | JIT compilation policy disallows recompilation | Set `MLC_JIT_POLICY=ON` or provide a precompiled model library |
Compatibility Notes
- FlashInfer: Only available on Linux (`sys_platform == 'linux'`). Requires NVIDIA GPU with compute capability >= 80 (Ampere: A100, RTX 3090 or newer).
- cuBLAS GEMM: Only enabled for unquantized models (`q0f16`, `q0bf16`, `q0f32`) or FP8 quantized models (`e4m3`, `e5m2`). Not used for INT4/INT3 quantization.
- CUDA Graphs: Only enabled at O2+ optimization level. Provides kernel launch overhead reduction.
- Multi-arch fatbin: Set `MLC_MULTI_ARCH=80,86,89,90a` to generate code for multiple GPU architectures in a single binary.
- ROCm (AMD): Supported via hipBLAS dispatch. Uses `thrust`, `rocblas`, `miopen`, `hipblas` libraries.
- Windows: Compilation may return nonzero exit code even on success; the system checks for output file existence instead.