Environment:Turboderp org Exllamav2 Build Toolchain

Knowledge Sources	ExLlamaV2 PyTorch C++ Extensions
Domains	Infrastructure, Build_System
Last Updated	2026-02-15 00:00 GMT

Overview

C++/CUDA compilation toolchain required for building the ExLlamaV2 native extension, either at install time or via JIT compilation at first run.

Description

ExLlamaV2 includes a C++/CUDA native extension (`exllamav2_ext`) that implements performance-critical kernels for quantized matrix operations, attention, normalization, sampling, and tensor parallelism. The extension consists of 14 C++ source files and 25 CUDA kernel files.

The extension can be built in two ways:

Pre-compiled wheel: Distributed via PyPI for common CUDA/PyTorch/OS combinations. No build tools needed at runtime.
JIT compilation: If no pre-compiled wheel is available, the extension is compiled on first import using `torch.utils.cpp_extension.load()`. This requires a C++ compiler and CUDA toolkit.

The `EXLLAMA_NOCOMPILE` environment variable can skip precompilation during `pip install` to defer to JIT compilation.

Usage

This environment is required when installing from source or when no pre-built wheel matches your CUDA/PyTorch combination. If using a pre-built wheel, the build toolchain is not needed at runtime.

System Requirements

Category	Requirement	Notes
OS	Linux (GCC) or Windows (MSVC 2017-2022)	macOS not supported
C++ Compiler	GCC (Linux) or MSVC (Windows)	MSVC auto-detected across Community, Professional, Enterprise, BuildTools editions
CUDA Toolkit	Matching PyTorch CUDA version	11.8, 12.1, 12.4, or 12.8
Build System	`ninja`	Required for parallel JIT compilation

Dependencies

System Packages

`gcc` / `g++` (Linux) or MSVC 2017-2022 (Windows)
CUDA Toolkit with `nvcc` compiler
`ninja` build system

Python Packages

`torch` >= 2.2.0 (provides `torch.utils.cpp_extension`)
`setuptools`
`wheel`

Credentials

The following environment variables control the build process:

`EXLLAMA_NOCOMPILE`: Set to skip precompilation during pip install (defers to JIT)
`EXLLAMA_VERBOSE`: Set to enable verbose compilation output
`EXLLAMA_EXT_DEBUG`: Set to compile with debug flags (`-ftime-report`, `-DTORCH_USE_CUDA_DSA`)
`TORCH_CUDA_ARCH_LIST`: Override auto-detected GPU compute capabilities for compilation

Quick Install

# Install build dependencies
pip install ninja setuptools wheel

# Install from source (triggers compilation)
pip install exllamav2 --no-binary exllamav2

# Or skip precompilation (JIT on first run)
EXLLAMA_NOCOMPILE=1 pip install exllamav2

Code Evidence

Extension loading with JIT fallback from `exllamav2/ext.py:106-117`:

try:
    import exllamav2_ext
except ModuleNotFoundError:
    build_jit = True
except ImportError as e:
    if "undefined symbol" in str(e):
        print("\"undefined symbol\" error here usually means you are attempting to load "
              "a prebuilt extension wheel that was compiled against a different version "
              "of PyTorch than the one you are you using.")
    raise e

CUDA arch list auto-detection from `exllamav2/ext.py:19-48`:

def maybe_set_arch_list_env():
    if os.environ.get('TORCH_CUDA_ARCH_LIST', None):
        return
    if not torch.version.cuda:
        return
    arch_list = []
    for i in range(torch.cuda.device_count()):
        capability = torch.cuda.get_device_capability(i)
        supported_sm = [int(arch.split('_')[1])
                        for arch in torch.cuda.get_arch_list() if 'sm_' in arch]
        ...
    os.environ["TORCH_CUDA_ARCH_LIST"] = ";".join(arch_list)

Windows MSVC detection from `exllamav2/ext.py:123-152`:

def find_msvc():
    for year in ['2022', '2019', '2017']:
        for edition in ['Community', 'Professional', 'Enterprise', 'BuildTools']:
            for root_key in ['ProgramW6432', 'ProgramFiles(x86)']:
                ...
    return None

Compilation flags from `exllamav2/ext.py:176-189`:

# Linux
extra_cflags = ["-Ofast"]
# Windows
extra_cflags = ["/Ox"]
# NVCC (both platforms)
extra_cuda_cflags = ["-lineinfo", "-O3"]
# ROCm
if torch.version.hip:
    extra_cuda_cflags += ["-DHIPBLAS_USE_HIP_HALF"]

Common Errors

Error Message	Cause	Solution
`undefined symbol` on import	Prebuilt wheel compiled against different PyTorch version	Reinstall ExLlamaV2 matching your exact PyTorch version, or install from source
`ModuleNotFoundError: No module named 'exllamav2_ext'` + JIT build failure	Missing C++ compiler or CUDA toolkit	Install GCC (Linux) or MSVC (Windows), ensure CUDA toolkit is in PATH
`ninja: command not found`	Ninja build system not installed	`pip install ninja`
NVCC compilation error for unsupported architecture	GPU compute capability not in `TORCH_CUDA_ARCH_LIST`	Set `TORCH_CUDA_ARCH_LIST` manually or update CUDA toolkit

Compatibility Notes

Pre-built wheels: Available for CUDA 11.8/12.1/12.4/12.8 on Linux and Windows with Python 3.10-3.13 and PyTorch 2.2-2.9. Using a pre-built wheel eliminates the need for a build toolchain.
ROCm builds: Add `-DHIPBLAS_USE_HIP_HALF` automatically. ROCm does not need `TORCH_CUDA_ARCH_LIST` auto-detection.
Windows: MSVC auto-detection searches Visual Studio 2017-2022 installations. If cl.exe is not in PATH, the build system injects the compiler path automatically.
JIT compilation: Triggered automatically on first import if no prebuilt module is found. Compiles 39 source files (14 C++ + 25 CUDA).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment