Environment:Turboderp org Exllamav2 CUDA GPU Runtime

Knowledge Sources	ExLlamaV2 PyTorch CUDA
Domains	Infrastructure, Deep_Learning, GPU_Computing
Last Updated	2026-02-15 00:00 GMT

Overview

Linux or Windows environment with NVIDIA CUDA or AMD ROCm GPU, Python 3.8+, and PyTorch >= 2.2.0 with CUDA/ROCm support.

Description

This environment defines the core runtime requirements for all ExLlamaV2 operations: inference, quantization, and model conversion. It requires a GPU-accelerated build of PyTorch (either CUDA for NVIDIA or ROCm/HIP for AMD). The library sets CUDA_MODULE_LOADING=LAZY at import time to reduce startup overhead and limits PyTorch CPU threads to 1 globally to avoid small-operation threading overhead.

Pre-built wheels are available for CUDA 11.8, 12.1, 12.4, and 12.8 across PyTorch 2.2 through 2.9. ROCm wheels support versions 5.6, 6.0, and 6.1.

Usage

Use this environment for all ExLlamaV2 operations. Every workflow (Text Generation, Interactive Chat, Model Conversion, LoRA Inference, Vision Inference, Bulk Inference) requires a CUDA or ROCm GPU with PyTorch installed.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+) or Windows 10+	CI builds on ubuntu-22.04 and windows-2022; ROCm is Linux-only
Hardware	NVIDIA GPU (Pascal SM 6.0+) or AMD GPU (ROCm)	Pre-built wheels support SM 6.0 through SM 12.0 (Blackwell); Flash Attention requires Ampere SM 8.0+
VRAM	Depends on model size	7B models ~8 GB; 70B models ~24-48 GB with quantization
Python	>= 3.8	CI targets 3.10, 3.11, 3.12, 3.13
CLI Tools	`nvidia-smi` or `rocm-smi`	Required for auto-split and tensor-parallel GPU memory detection

Dependencies

System Packages

`nvidia-smi` (NVIDIA) or `rocm-smi` (AMD) for GPU memory queries
CUDA Toolkit 11.8 / 12.1 / 12.4 / 12.8 (matching PyTorch build)
`ninja` (required for JIT compilation of C++/CUDA extension)

Python Packages

`torch` >= 2.2.0 (with CUDA or ROCm support)
`safetensors` >= 0.3.2
`numpy` ~= 1.26.4
`pandas`
`fastparquet`
`pygments`
`websockets`
`regex`
`tokenizers`
`rich`
`pillow` >= 9.1.0

Credentials

No API credentials are required for core ExLlamaV2 functionality.

The following environment variables control runtime behavior:

`CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible to PyTorch
`CUDA_MODULE_LOADING`: Automatically set to `LAZY` by ExLlamaV2 at import time
`TORCH_CUDA_ARCH_LIST`: Auto-detected for JIT compilation; can be set manually to override

Quick Install

# Install PyTorch with CUDA support (example for CUDA 12.4)
pip install torch>=2.2.0 --index-url https://download.pytorch.org/whl/cu124

# Install ExLlamaV2 and dependencies
pip install exllamav2

# Or install from source
pip install torch>=2.2.0 safetensors>=0.3.2 ninja numpy pandas fastparquet pygments websockets regex tokenizers rich pillow>=9.1.0

Code Evidence

Python version check from `exllamav2/model.py:4-8`:

min_version = (3, 8)
if sys.version_info < min_version:
    print(f" ## Warning: this project requires Python {min_version[0]}.{min_version[1]} or higher.")

CUDA/ROCm requirement check from `exllamav2/model.py:22-25`:

if not (torch.version.cuda or torch.version.hip):
    print("")
    print(f" ## Warning: The installed version of PyTorch is {torch.__version__} and does not support CUDA or ROCm.")
    print("")

CUDA lazy module loading from `exllamav2/model.py:11`:

os.environ["CUDA_MODULE_LOADING"] = "LAZY"

GPU memory detection assertion from `exllamav2/util.py:305-331`:

def get_all_gpu_memory():
    ...
    assert gpu_memory, \
        "Unable to read available VRAM from either nvidia-smi or rocm-smi"

GPU split validation from `exllamav2/model_init.py:171-173`:

if len(split) > torch.cuda.device_count():
    print(f" ## Error: Too many entries in gpu_split. {torch.cuda.device_count()} CUDA devices are available.")
    sys.exit()

Common Errors

Error Message	Cause	Solution
`Warning: The installed version of PyTorch does not support CUDA or ROCm`	PyTorch installed without GPU support	Reinstall PyTorch with CUDA: `pip install torch --index-url https://download.pytorch.org/whl/cu124`
`Unable to read available VRAM from either nvidia-smi or rocm-smi`	GPU management tools not installed or not in PATH	Install NVIDIA drivers (includes nvidia-smi) or ROCm toolkit (includes rocm-smi)
`Too many entries in gpu_split`	More GPUs specified in split than physically available	Check `torch.cuda.device_count()` and adjust `--gpu_split` parameter
`Insufficient VRAM for model and cache`	Model and cache exceed total GPU memory	Use smaller model, quantized cache (Q4), or add more GPUs via auto-split
`CUDA out of memory` / `HIP out of memory`	GPU memory exhausted during loading or inference	Reduce `max_seq_len`, use quantized cache, or distribute across more GPUs

Compatibility Notes

ROCm (AMD): Supported via HIP. Pre-built wheels available for ROCm 5.6, 6.0, 6.1 (Linux only). The `-DHIPBLAS_USE_HIP_HALF` flag is added automatically for ROCm builds. SDPA is noted as "unreliable on ROCm" in the source code.
Windows: Supported with MSVC compiler. The build system auto-detects MSVC 2017-2022 across Community, Professional, Enterprise, and BuildTools editions.
Multi-GPU: Peer-to-peer GPU copy is tested at runtime. If direct GPU-to-GPU copy fails (known issue on some driver versions), data is routed through CPU RAM automatically.
Pre-Ampere GPUs (SM < 8.0): Supported for inference but cannot use Flash Attention. Use xformers or Torch SDPA as attention backend instead.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment