Environment:Mlc ai Mlc llm CUDA GPU Environment

Knowledge Sources	MLC-LLM NVIDIA CUDA Toolkit
Domains	Infrastructure, GPU_Acceleration, Deep_Learning
Last Updated	2026-02-09 19:00 GMT

Overview

Linux-based NVIDIA GPU environment with CUDA toolkit, cuBLAS, FlashInfer, and optional Triton for high-performance LLM inference compilation and serving.

Description

This environment provides GPU-accelerated inference for MLC-LLM on NVIDIA hardware. It includes the full CUDA toolkit for kernel compilation, cuBLAS/hipBLAS for GEMM dispatch, FlashInfer for optimized attention kernels and GPU sampling, and Triton for FP8 quantized matmul kernels. The environment supports multi-GPU tensor parallelism via IPC allreduce and CUDA graph capture for reduced kernel launch overhead. FlashInfer requires CUDA compute capability >= 80 (Ampere or newer).

Usage

Use this environment for any model compilation targeting CUDA GPUs, REST API serving with GPU acceleration, or Python engine inference. It is the mandatory prerequisite for running the compilation pipeline with optimization level O2 or O3, which enable FlashInfer, cuBLAS GEMM dispatch, CUDA graphs, and CUTLASS kernels.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+ recommended)	Windows supported via WSL2 or native DLL
Hardware	NVIDIA GPU with CUDA support	FlashInfer requires compute capability >= 80 (A100, RTX 3090+)
VRAM	Minimum 4GB	7B models need ~4-8GB (quantized), 70B models need multi-GPU
Disk	10GB+ SSD	For compiled model libraries and cached JIT artifacts

Dependencies

System Packages

`cuda-toolkit` (CUDA 12.x recommended)
`cudnn` (for cuDNN acceleration)
`cmake` < 4.0
`git`
`bzip2`

Python Packages

`torch` (PyTorch, used for weight conversion)
`apache-tvm-ffi` (TVM FFI bindings, core runtime)
`flashinfer-python` (Linux only, for FlashInfer attention kernels)
`ml_dtypes` >= 0.5.1
`transformers` (for model config and tokenizers)
`safetensors` (for weight loading)
`sentencepiece` (for tokenizer support)
`tiktoken` (for tokenizer support)

Credentials

The following environment variables may be needed:

`MLC_LLM_HOME`: Override the default cache directory for compiled model libraries (default: `~/.cache/mlc_llm`).
`MLC_JIT_POLICY`: Controls JIT compilation behavior. Values: `ON` (default), `OFF`, `REDO`, `READONLY`.
`MLC_MULTI_ARCH`: Comma-separated CUDA architectures for multi-arch fatbin compilation. Example: `70,72,75,80,86,87,89,90a`.
`MLC_TEMP_DIR`: Override temporary directory for compilation artifacts.
`SKIP_LOADING_MLCLLM_SO`: Set to `1` to skip loading the MLC-LLM shared library.

Quick Install

# Install core Python dependencies
pip install mlc-llm

# Or install individual packages
pip install apache-tvm-ffi torch transformers safetensors sentencepiece tiktoken
pip install flashinfer-python  # Linux only, requires CUDA >= sm_80
pip install ml_dtypes>=0.5.1

Code Evidence

FlashInfer requires CUDA arch >= 80, from `compiler_flags.py:87-101`:

def _flashinfer(target) -> bool:
    from mlc_llm.support.auto_target import detect_cuda_arch_list
    if not self.flashinfer:
        return False
    if target.kind.name != "cuda":
        return False
    arch_list = detect_cuda_arch_list(target)
    for arch in arch_list:
        if arch < 80:
            logger.warning("flashinfer is not supported on CUDA arch < 80")
            return False
    return True

cuBLAS dispatch requires CUDA or ROCm target, from `blas_dispatch.py:20-32`:

def __init__(self, target: tvm.target.Target) -> None:
    if target.kind.name == "cuda":
        self.has_blas = tvm.get_global_func("relax.ext.cublas", True)
        if not self.has_blas:
            raise Exception("cuBLAS is not enabled.")
        self.patterns = get_patterns_with_prefix("cublas")
    elif target.kind.name == "rocm":
        self.has_blas = tvm.get_global_func("relax.ext.hipblas", True)
        if not self.has_blas:
            raise Exception("hipBLAS is not enabled.")
    else:
        raise Exception(f"Unsupported target {target.kind.name} for BLAS dispatch.")

Triton kernel dispatch is CUDA-only, from `dispatch_triton_kernel.py:169-174`:

def transform_module(self, mod: IRModule, _ctx: tvm.transform.PassContext) -> IRModule:
    if self.target.kind.name != "cuda":
        return mod
    return _Rewriter(mod, self.target).transform()

CUDA multi-arch detection from `auto_target.py:313-328`:

def detect_cuda_arch_list(target: Target) -> List[int]:
    assert target.kind.name == "cuda"
    if MLC_MULTI_ARCH is not None:
        multi_arch = [convert_to_num(x) for x in MLC_MULTI_ARCH.split(",")]
    else:
        assert target.arch.startswith("sm_")
        multi_arch = [convert_to_num(target.arch[3:])]
    return list(set(multi_arch))

Common Errors

Error Message	Cause	Solution
`cuBLAS is not enabled.`	TVM built without cuBLAS support	Rebuild TVM with `USE_CUBLAS=ON` in cmake config
`flashinfer is not supported on CUDA arch < 80`	GPU too old for FlashInfer	Use O0 or O1 optimization level, or upgrade GPU to Ampere+
`Insufficient GPU memory error`	Model weights + buffers exceed VRAM	Set larger `gpu_memory_utilization`, use quantization, or enable tensor parallelism
`Cannot find compilation output, compilation failed`	JIT compilation failed	Check CUDA toolkit installation and nvcc availability
`JIT is disabled by MLC_JIT_POLICY=OFF`	JIT compilation policy disallows recompilation	Set `MLC_JIT_POLICY=ON` or provide a precompiled model library

Compatibility Notes

FlashInfer: Only available on Linux (`sys_platform == 'linux'`). Requires NVIDIA GPU with compute capability >= 80 (Ampere: A100, RTX 3090 or newer).
cuBLAS GEMM: Only enabled for unquantized models (`q0f16`, `q0bf16`, `q0f32`) or FP8 quantized models (`e4m3`, `e5m2`). Not used for INT4/INT3 quantization.
CUDA Graphs: Only enabled at O2+ optimization level. Provides kernel launch overhead reduction.
Multi-arch fatbin: Set `MLC_MULTI_ARCH=80,86,89,90a` to generate code for multiple GPU architectures in a single binary.
ROCm (AMD): Supported via hipBLAS dispatch. Uses `thrust`, `rocblas`, `miopen`, `hipblas` libraries.
Windows: Compilation may return nonzero exit code even on success; the system checks for output file existence instead.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment