Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Mit han lab Llm awq CUDA Build Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, CUDA
Last Updated 2026-02-15 01:00 GMT

Overview

CUDA 11.0+ build environment with nvcc compiler for compiling AWQ inference engine and FasterTransformer attention CUDA extensions.

Description

This environment provides the CUDA compilation toolchain required to build the custom CUDA kernels that power AWQ quantized inference. The project includes two separate CUDA extension packages: awq_inference_engine (quantized GEMM/GEMV, LayerNorm, RoPE, W8A8 kernels) and ft_attention (FasterTransformer masked multi-head attention). Both require nvcc from the CUDA toolkit, and the bare-metal CUDA version must match the CUDA version used to compile the installed PyTorch binary. The extensions target Volta (sm_70), Ampere (sm_80/sm_86), and optionally Hopper (sm_90) architectures depending on CUDA version.

Usage

Use this environment when building AWQ from source or when the pre-built CUDA kernels are not available for your platform. This is a mandatory prerequisite for the awq_inference_engine import in `qmodule.py` and for any TinyChat deployment that uses fused attention, fused normalization, or quantized linear layers.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) Windows not officially supported
Hardware NVIDIA GPU (Volta or newer) Compute capability >= 7.0 (V100, A100, H100, RTX 20xx+)
CUDA Toolkit >= 11.0 (11.8+ recommended) ft_attention raises RuntimeError below CUDA 11.0
nvcc Must be discoverable via CUDA_HOME Docker: only 'devel' tagged images include nvcc
C++ Compiler C++17 support required GCC 7+ or equivalent

Dependencies

System Packages

  • `cuda-toolkit` >= 11.0 (11.8+ recommended for Hopper sm_90 support)
  • `nvcc` (included in cuda-toolkit)
  • `gcc` / `g++` with C++17 support
  • `openmp` (for CPU parallelism in host code)

Python Packages

  • `torch` == 2.3.0 (must match CUDA version)
  • `setuptools` >= 61.0
  • `packaging` (for Version parsing in build scripts)

Credentials

The following environment variables are used during the build process:

  • `CUDA_HOME`: Path to CUDA toolkit installation (required; nvcc must be at `$CUDA_HOME/bin/nvcc`)
  • `TORCH_CUDA_ARCH_LIST`: Optional override for target compute capabilities (e.g., `"7.0;8.0;9.0"`)

Quick Install

# Build AWQ inference engine (from repository root)
cd awq/kernels
python setup.py install

# Build FasterTransformer attention kernel
cd awq/kernels/csrc/attention
python setup.py install

Code Evidence

CUDA minimum version check from `awq/kernels/csrc/attention/setup.py:112-113`:

if bare_metal_version < Version("11.0"):
    raise RuntimeError("ft_attention is only supported on CUDA 11 and above")

CUDA/PyTorch version matching validation from `awq/kernels/csrc/attention/setup.py:34-49`:

def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
    raw_output, bare_metal_version = get_cuda_bare_metal_version(cuda_dir)
    torch_binary_version = parse(torch.version.cuda)
    if bare_metal_version != torch_binary_version:
        raise RuntimeError(
            "Cuda extensions are being compiled with a version of Cuda that does "
            "not match the version used to compile Pytorch binaries.  "
            "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
        )

nvcc availability check from `awq/kernels/csrc/attention/setup.py:52-59`:

def raise_if_cuda_home_none(global_option: str) -> None:
    if CUDA_HOME is not None:
        return
    raise RuntimeError(
        f"{global_option} was requested, but nvcc was not found.  "
        "Are you sure your environment has nvcc available?"
    )

Architecture-specific code generation from `awq/kernels/csrc/attention/setup.py:114-120`:

cc_flag.append("-gencode")
cc_flag.append("arch=compute_70,code=sm_70")
cc_flag.append("-gencode")
cc_flag.append("arch=compute_80,code=sm_80")
if bare_metal_version >= Version("11.8"):
    cc_flag.append("-gencode")
    cc_flag.append("arch=compute_90,code=sm_90")

Common Errors

Error Message Cause Solution
`RuntimeError: ft_attention is only supported on CUDA 11 and above` CUDA toolkit version < 11.0 Install CUDA 11.0+ toolkit
`RuntimeError: nvcc was not found` CUDA_HOME not set or nvcc missing Set `CUDA_HOME` to CUDA toolkit path; use Docker 'devel' images
`RuntimeError: Cuda extensions compiled with mismatched Cuda version` Bare-metal CUDA != PyTorch CUDA Reinstall PyTorch with matching CUDA version or update CUDA toolkit
`ImportError: awq_inference_engine` CUDA extensions not compiled Run `cd awq/kernels && python setup.py install`

Compatibility Notes

  • CUDA 11.0-11.7: Supports Volta (sm_70) and Ampere (sm_80) only; no Hopper support
  • CUDA 11.8+: Adds Hopper (sm_90) support for H100 GPUs
  • CUDA 11.2+: Enables multi-threaded nvcc compilation (`--threads 4`)
  • Cross-compilation: Supported when no GPU is visible; auto-detects architecture list from CUDA version
  • Jetson (Edge): Cannot use PyTorch 2.3.0; must use NVIDIA prebuilt PyTorch >= 2.0.0 and set appropriate `TORCH_CUDA_ARCH_LIST`
  • BF16 Support: Enabled by default (`-DENABLE_BF16`); requires Ampere or newer GPU

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment