Environment:Mit han lab Llm awq CUDA Build Environment

Knowledge Sources	llm-awq NVIDIA CUDA Toolkit
Domains	Infrastructure, CUDA
Last Updated	2026-02-15 01:00 GMT

Overview

CUDA 11.0+ build environment with nvcc compiler for compiling AWQ inference engine and FasterTransformer attention CUDA extensions.

Description

This environment provides the CUDA compilation toolchain required to build the custom CUDA kernels that power AWQ quantized inference. The project includes two separate CUDA extension packages: awq_inference_engine (quantized GEMM/GEMV, LayerNorm, RoPE, W8A8 kernels) and ft_attention (FasterTransformer masked multi-head attention). Both require nvcc from the CUDA toolkit, and the bare-metal CUDA version must match the CUDA version used to compile the installed PyTorch binary. The extensions target Volta (sm_70), Ampere (sm_80/sm_86), and optionally Hopper (sm_90) architectures depending on CUDA version.

Usage

Use this environment when building AWQ from source or when the pre-built CUDA kernels are not available for your platform. This is a mandatory prerequisite for the awq_inference_engine import in `qmodule.py` and for any TinyChat deployment that uses fused attention, fused normalization, or quantized linear layers.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	Windows not officially supported
Hardware	NVIDIA GPU (Volta or newer)	Compute capability >= 7.0 (V100, A100, H100, RTX 20xx+)
CUDA Toolkit	>= 11.0 (11.8+ recommended)	ft_attention raises RuntimeError below CUDA 11.0
nvcc	Must be discoverable via CUDA_HOME	Docker: only 'devel' tagged images include nvcc
C++ Compiler	C++17 support required	GCC 7+ or equivalent

Dependencies

System Packages

`cuda-toolkit` >= 11.0 (11.8+ recommended for Hopper sm_90 support)
`nvcc` (included in cuda-toolkit)
`gcc` / `g++` with C++17 support
`openmp` (for CPU parallelism in host code)

Python Packages

`torch` == 2.3.0 (must match CUDA version)
`setuptools` >= 61.0
`packaging` (for Version parsing in build scripts)

Credentials

The following environment variables are used during the build process:

`CUDA_HOME`: Path to CUDA toolkit installation (required; nvcc must be at `$CUDA_HOME/bin/nvcc`)
`TORCH_CUDA_ARCH_LIST`: Optional override for target compute capabilities (e.g., `"7.0;8.0;9.0"`)

Quick Install

# Build AWQ inference engine (from repository root)
cd awq/kernels
python setup.py install

# Build FasterTransformer attention kernel
cd awq/kernels/csrc/attention
python setup.py install

Code Evidence

CUDA minimum version check from `awq/kernels/csrc/attention/setup.py:112-113`:

if bare_metal_version < Version("11.0"):
    raise RuntimeError("ft_attention is only supported on CUDA 11 and above")

CUDA/PyTorch version matching validation from `awq/kernels/csrc/attention/setup.py:34-49`:

def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
    raw_output, bare_metal_version = get_cuda_bare_metal_version(cuda_dir)
    torch_binary_version = parse(torch.version.cuda)
    if bare_metal_version != torch_binary_version:
        raise RuntimeError(
            "Cuda extensions are being compiled with a version of Cuda that does "
            "not match the version used to compile Pytorch binaries.  "
            "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
        )

nvcc availability check from `awq/kernels/csrc/attention/setup.py:52-59`:

def raise_if_cuda_home_none(global_option: str) -> None:
    if CUDA_HOME is not None:
        return
    raise RuntimeError(
        f"{global_option} was requested, but nvcc was not found.  "
        "Are you sure your environment has nvcc available?"
    )

Architecture-specific code generation from `awq/kernels/csrc/attention/setup.py:114-120`:

cc_flag.append("-gencode")
cc_flag.append("arch=compute_70,code=sm_70")
cc_flag.append("-gencode")
cc_flag.append("arch=compute_80,code=sm_80")
if bare_metal_version >= Version("11.8"):
    cc_flag.append("-gencode")
    cc_flag.append("arch=compute_90,code=sm_90")

Common Errors

Error Message	Cause	Solution
`RuntimeError: ft_attention is only supported on CUDA 11 and above`	CUDA toolkit version < 11.0	Install CUDA 11.0+ toolkit
`RuntimeError: nvcc was not found`	CUDA_HOME not set or nvcc missing	Set `CUDA_HOME` to CUDA toolkit path; use Docker 'devel' images
`RuntimeError: Cuda extensions compiled with mismatched Cuda version`	Bare-metal CUDA != PyTorch CUDA	Reinstall PyTorch with matching CUDA version or update CUDA toolkit
`ImportError: awq_inference_engine`	CUDA extensions not compiled	Run `cd awq/kernels && python setup.py install`

Compatibility Notes

CUDA 11.0-11.7: Supports Volta (sm_70) and Ampere (sm_80) only; no Hopper support
CUDA 11.8+: Adds Hopper (sm_90) support for H100 GPUs
CUDA 11.2+: Enables multi-threaded nvcc compilation (`--threads 4`)
Cross-compilation: Supported when no GPU is visible; auto-detects architecture list from CUDA version
Jetson (Edge): Cannot use PyTorch 2.3.0; must use NVIDIA prebuilt PyTorch >= 2.0.0 and set appropriate `TORCH_CUDA_ARCH_LIST`
BF16 Support: Enabled by default (`-DENABLE_BF16`); requires Ampere or newer GPU

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment