Environment:Turboderp org Exllamav2 CUDA GPU Runtime
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, GPU_Computing |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Linux or Windows environment with NVIDIA CUDA or AMD ROCm GPU, Python 3.8+, and PyTorch >= 2.2.0 with CUDA/ROCm support.
Description
This environment defines the core runtime requirements for all ExLlamaV2 operations: inference, quantization, and model conversion. It requires a GPU-accelerated build of PyTorch (either CUDA for NVIDIA or ROCm/HIP for AMD). The library sets CUDA_MODULE_LOADING=LAZY at import time to reduce startup overhead and limits PyTorch CPU threads to 1 globally to avoid small-operation threading overhead.
Pre-built wheels are available for CUDA 11.8, 12.1, 12.4, and 12.8 across PyTorch 2.2 through 2.9. ROCm wheels support versions 5.6, 6.0, and 6.1.
Usage
Use this environment for all ExLlamaV2 operations. Every workflow (Text Generation, Interactive Chat, Model Conversion, LoRA Inference, Vision Inference, Bulk Inference) requires a CUDA or ROCm GPU with PyTorch installed.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) or Windows 10+ | CI builds on ubuntu-22.04 and windows-2022; ROCm is Linux-only |
| Hardware | NVIDIA GPU (Pascal SM 6.0+) or AMD GPU (ROCm) | Pre-built wheels support SM 6.0 through SM 12.0 (Blackwell); Flash Attention requires Ampere SM 8.0+ |
| VRAM | Depends on model size | 7B models ~8 GB; 70B models ~24-48 GB with quantization |
| Python | >= 3.8 | CI targets 3.10, 3.11, 3.12, 3.13 |
| CLI Tools | `nvidia-smi` or `rocm-smi` | Required for auto-split and tensor-parallel GPU memory detection |
Dependencies
System Packages
- `nvidia-smi` (NVIDIA) or `rocm-smi` (AMD) for GPU memory queries
- CUDA Toolkit 11.8 / 12.1 / 12.4 / 12.8 (matching PyTorch build)
- `ninja` (required for JIT compilation of C++/CUDA extension)
Python Packages
- `torch` >= 2.2.0 (with CUDA or ROCm support)
- `safetensors` >= 0.3.2
- `numpy` ~= 1.26.4
- `pandas`
- `fastparquet`
- `pygments`
- `websockets`
- `regex`
- `tokenizers`
- `rich`
- `pillow` >= 9.1.0
Credentials
No API credentials are required for core ExLlamaV2 functionality.
The following environment variables control runtime behavior:
- `CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible to PyTorch
- `CUDA_MODULE_LOADING`: Automatically set to `LAZY` by ExLlamaV2 at import time
- `TORCH_CUDA_ARCH_LIST`: Auto-detected for JIT compilation; can be set manually to override
Quick Install
# Install PyTorch with CUDA support (example for CUDA 12.4)
pip install torch>=2.2.0 --index-url https://download.pytorch.org/whl/cu124
# Install ExLlamaV2 and dependencies
pip install exllamav2
# Or install from source
pip install torch>=2.2.0 safetensors>=0.3.2 ninja numpy pandas fastparquet pygments websockets regex tokenizers rich pillow>=9.1.0
Code Evidence
Python version check from `exllamav2/model.py:4-8`:
min_version = (3, 8)
if sys.version_info < min_version:
print(f" ## Warning: this project requires Python {min_version[0]}.{min_version[1]} or higher.")
CUDA/ROCm requirement check from `exllamav2/model.py:22-25`:
if not (torch.version.cuda or torch.version.hip):
print("")
print(f" ## Warning: The installed version of PyTorch is {torch.__version__} and does not support CUDA or ROCm.")
print("")
CUDA lazy module loading from `exllamav2/model.py:11`:
os.environ["CUDA_MODULE_LOADING"] = "LAZY"
GPU memory detection assertion from `exllamav2/util.py:305-331`:
def get_all_gpu_memory():
...
assert gpu_memory, \
"Unable to read available VRAM from either nvidia-smi or rocm-smi"
GPU split validation from `exllamav2/model_init.py:171-173`:
if len(split) > torch.cuda.device_count():
print(f" ## Error: Too many entries in gpu_split. {torch.cuda.device_count()} CUDA devices are available.")
sys.exit()
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Warning: The installed version of PyTorch does not support CUDA or ROCm` | PyTorch installed without GPU support | Reinstall PyTorch with CUDA: `pip install torch --index-url https://download.pytorch.org/whl/cu124` |
| `Unable to read available VRAM from either nvidia-smi or rocm-smi` | GPU management tools not installed or not in PATH | Install NVIDIA drivers (includes nvidia-smi) or ROCm toolkit (includes rocm-smi) |
| `Too many entries in gpu_split` | More GPUs specified in split than physically available | Check `torch.cuda.device_count()` and adjust `--gpu_split` parameter |
| `Insufficient VRAM for model and cache` | Model and cache exceed total GPU memory | Use smaller model, quantized cache (Q4), or add more GPUs via auto-split |
| `CUDA out of memory` / `HIP out of memory` | GPU memory exhausted during loading or inference | Reduce `max_seq_len`, use quantized cache, or distribute across more GPUs |
Compatibility Notes
- ROCm (AMD): Supported via HIP. Pre-built wheels available for ROCm 5.6, 6.0, 6.1 (Linux only). The `-DHIPBLAS_USE_HIP_HALF` flag is added automatically for ROCm builds. SDPA is noted as "unreliable on ROCm" in the source code.
- Windows: Supported with MSVC compiler. The build system auto-detects MSVC 2017-2022 across Community, Professional, Enterprise, and BuildTools editions.
- Multi-GPU: Peer-to-peer GPU copy is tested at runtime. If direct GPU-to-GPU copy fails (known issue on some driver versions), data is routed through CPU RAM automatically.
- Pre-Ampere GPUs (SM < 8.0): Supported for inference but cannot use Flash Attention. Use xformers or Torch SDPA as attention backend instead.
Related Pages
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2Config
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2Cache
- Implementation:Turboderp_org_Exllamav2_Load_Autosplit
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2DynamicGenerator_Init
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2DynamicGenerator_Generate
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2StreamingGenerator
- Implementation:Turboderp_org_Exllamav2_Measure_Quant
- Implementation:Turboderp_org_Exllamav2_Quant_Layers
- Implementation:Turboderp_org_Exllamav2_Compile_Model
- Implementation:Turboderp_org_Exllamav2_Model_Init