Environment:Mit han lab Llm awq CUDA Build Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, CUDA |
| Last Updated | 2026-02-15 01:00 GMT |
Overview
CUDA 11.0+ build environment with nvcc compiler for compiling AWQ inference engine and FasterTransformer attention CUDA extensions.
Description
This environment provides the CUDA compilation toolchain required to build the custom CUDA kernels that power AWQ quantized inference. The project includes two separate CUDA extension packages: awq_inference_engine (quantized GEMM/GEMV, LayerNorm, RoPE, W8A8 kernels) and ft_attention (FasterTransformer masked multi-head attention). Both require nvcc from the CUDA toolkit, and the bare-metal CUDA version must match the CUDA version used to compile the installed PyTorch binary. The extensions target Volta (sm_70), Ampere (sm_80/sm_86), and optionally Hopper (sm_90) architectures depending on CUDA version.
Usage
Use this environment when building AWQ from source or when the pre-built CUDA kernels are not available for your platform. This is a mandatory prerequisite for the awq_inference_engine import in `qmodule.py` and for any TinyChat deployment that uses fused attention, fused normalization, or quantized linear layers.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | Windows not officially supported |
| Hardware | NVIDIA GPU (Volta or newer) | Compute capability >= 7.0 (V100, A100, H100, RTX 20xx+) |
| CUDA Toolkit | >= 11.0 (11.8+ recommended) | ft_attention raises RuntimeError below CUDA 11.0 |
| nvcc | Must be discoverable via CUDA_HOME | Docker: only 'devel' tagged images include nvcc |
| C++ Compiler | C++17 support required | GCC 7+ or equivalent |
Dependencies
System Packages
- `cuda-toolkit` >= 11.0 (11.8+ recommended for Hopper sm_90 support)
- `nvcc` (included in cuda-toolkit)
- `gcc` / `g++` with C++17 support
- `openmp` (for CPU parallelism in host code)
Python Packages
- `torch` == 2.3.0 (must match CUDA version)
- `setuptools` >= 61.0
- `packaging` (for Version parsing in build scripts)
Credentials
The following environment variables are used during the build process:
- `CUDA_HOME`: Path to CUDA toolkit installation (required; nvcc must be at `$CUDA_HOME/bin/nvcc`)
- `TORCH_CUDA_ARCH_LIST`: Optional override for target compute capabilities (e.g., `"7.0;8.0;9.0"`)
Quick Install
# Build AWQ inference engine (from repository root)
cd awq/kernels
python setup.py install
# Build FasterTransformer attention kernel
cd awq/kernels/csrc/attention
python setup.py install
Code Evidence
CUDA minimum version check from `awq/kernels/csrc/attention/setup.py:112-113`:
if bare_metal_version < Version("11.0"):
raise RuntimeError("ft_attention is only supported on CUDA 11 and above")
CUDA/PyTorch version matching validation from `awq/kernels/csrc/attention/setup.py:34-49`:
def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
raw_output, bare_metal_version = get_cuda_bare_metal_version(cuda_dir)
torch_binary_version = parse(torch.version.cuda)
if bare_metal_version != torch_binary_version:
raise RuntimeError(
"Cuda extensions are being compiled with a version of Cuda that does "
"not match the version used to compile Pytorch binaries. "
"Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
)
nvcc availability check from `awq/kernels/csrc/attention/setup.py:52-59`:
def raise_if_cuda_home_none(global_option: str) -> None:
if CUDA_HOME is not None:
return
raise RuntimeError(
f"{global_option} was requested, but nvcc was not found. "
"Are you sure your environment has nvcc available?"
)
Architecture-specific code generation from `awq/kernels/csrc/attention/setup.py:114-120`:
cc_flag.append("-gencode")
cc_flag.append("arch=compute_70,code=sm_70")
cc_flag.append("-gencode")
cc_flag.append("arch=compute_80,code=sm_80")
if bare_metal_version >= Version("11.8"):
cc_flag.append("-gencode")
cc_flag.append("arch=compute_90,code=sm_90")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: ft_attention is only supported on CUDA 11 and above` | CUDA toolkit version < 11.0 | Install CUDA 11.0+ toolkit |
| `RuntimeError: nvcc was not found` | CUDA_HOME not set or nvcc missing | Set `CUDA_HOME` to CUDA toolkit path; use Docker 'devel' images |
| `RuntimeError: Cuda extensions compiled with mismatched Cuda version` | Bare-metal CUDA != PyTorch CUDA | Reinstall PyTorch with matching CUDA version or update CUDA toolkit |
| `ImportError: awq_inference_engine` | CUDA extensions not compiled | Run `cd awq/kernels && python setup.py install` |
Compatibility Notes
- CUDA 11.0-11.7: Supports Volta (sm_70) and Ampere (sm_80) only; no Hopper support
- CUDA 11.8+: Adds Hopper (sm_90) support for H100 GPUs
- CUDA 11.2+: Enables multi-threaded nvcc compilation (`--threads 4`)
- Cross-compilation: Supported when no GPU is visible; auto-detects architecture list from CUDA version
- Jetson (Edge): Cannot use PyTorch 2.3.0; must use NVIDIA prebuilt PyTorch >= 2.0.0 and set appropriate `TORCH_CUDA_ARCH_LIST`
- BF16 Support: Enabled by default (`-DENABLE_BF16`); requires Ampere or newer GPU