Environment:Predibase Lorax CUDA GPU Runtime
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing |
| Last Updated | 2026-02-08 02:30 GMT |
Overview
NVIDIA CUDA GPU runtime environment with compute capability 7.5+ (Turing or newer), CUDA 12.4, and PyTorch 2.4+ for LoRAX inference serving.
Description
This environment provides the GPU acceleration context required to run the LoRAX inference server. LoRAX detects the GPU platform at startup via `torch.version.cuda`, `torch.version.hip` (ROCm), or `intel_extension_for_pytorch` (XPU) and selects the appropriate backend. The primary target is NVIDIA CUDA GPUs with specific compute capability tiers gating different features:
- SM 7.5 (Turing): Minimum for Flash Attention V1 support (e.g., T4, RTX 2080)
- SM 8.0+ (Ampere): Required for Flash Attention V2, ExLLaMA kernels, Punica SGMV, EETQ, AWQ (e.g., A100, A10G)
- SM 8.9+ (Ada Lovelace): Required for FP8 quantization (e.g., RTX 4090, L4)
- SM 9.0 (Hopper): Full support including FP8 and all kernel optimizations (e.g., H100)
AMD ROCm is supported for MI210/MI250/MI300 GPUs. Intel XPU is experimentally supported.
Usage
This environment is a mandatory prerequisite for running any LoRAX model serving workflow. All Flash Attention, paged attention (vLLM), custom CUDA kernels, and LoRA kernel operations require GPU acceleration. Without a compatible GPU, the server falls back to CPU mode with severely degraded performance and missing features.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Ubuntu 22.04 LTS | Docker base image: `nvidia/cuda:12.4.0-base-ubuntu22.04` |
| Hardware | NVIDIA GPU with SM 7.5+ | Minimum: T4 (16GB); Recommended: A100 (40/80GB) or H100 (80GB) |
| VRAM | 16GB minimum | Model-dependent; 7B models need ~16GB, 70B models need 80GB+ or multi-GPU |
| CUDA Version | 12.4 | Docker build uses `nvidia/cuda:12.4.0-devel-ubuntu22.04` |
| CUDA Driver | 550+ | Compatible with CUDA 12.4 toolkit |
| Disk | 50GB+ SSD | Model weights cached under `HUGGINGFACE_HUB_CACHE` |
Dependencies
System Packages
- `nvidia-cuda-runtime-cu12` = 12.1.105
- `nvidia-cudnn-cu12` = 9.1.0.70
- `nvidia-nccl-cu12` = 2.20.5
- `ninja-build` (for kernel compilation)
- `cmake` >= 3.30.0 (for vLLM kernel build)
Python Packages
- `torch` >= 2.4.0 (pinned 2.6.0 in requirements.txt)
- `triton` = 3.0.0 (Linux x86_64 only)
- `flash-attn` (V1 or V2 CUDA bindings)
- `flashinfer` = 0.1.6 (cu124, optional backend)
- `vllm` (custom ops for paged attention)
CUDA Kernel Packages (built from source in Docker)
- `custom_kernels` (SM 8.0, compute_80)
- `exllama_kernels` (SM 8.0+, for GPTQ V1)
- `exllamav2_kernels` (SM 8.0+, for GPTQ V2)
- `punica_kernels` (SM 8.0+, for LoRA SGMV/BGMV)
- `EETQ` (SM 8.0+, for 8-bit quantization)
- `vllm_flash_attn` (SM 7.0-9.0+)
Credentials
No GPU-specific credentials required. See Environment:Predibase_Lorax_Model_Source_Credentials for model access tokens.
Quick Install
# Recommended: Use the official Docker image
docker pull ghcr.io/predibase/lorax:latest
# Manual install (requires CUDA 12.4 toolkit pre-installed):
pip install torch==2.6.0 triton==3.0.0
pip install flash-attn --no-build-isolation
pip install flashinfer==0.1.6 -i https://flashinfer.ai/whl/cu124/torch2.4/
# Build custom kernels (requires ninja, cmake):
cd server && make install install-flash-attention-v2-cuda
Code Evidence
System detection from `server/lorax_server/utils/import_utils.py:26-44`:
SYSTEM = None
if torch.version.hip is not None:
SYSTEM = "rocm"
elif torch.version.cuda is not None and torch.cuda.is_available():
SYSTEM = "cuda"
elif is_xpu_available():
SYSTEM = "xpu"
else:
SYSTEM = "cpu"
GPU capability validation from `server/lorax_server/utils/flash_attn.py:57-93`:
if SYSTEM in {"cuda", "rocm"}:
if not torch.cuda.is_available():
raise ImportError("CUDA is not available")
major, minor = torch.cuda.get_device_capability()
is_sm75 = major == 7 and minor == 5
is_sm8x = major == 8 and minor >= 0
is_sm90 = major == 9 and minor == 0
# Flash Attention V2 requires SM 8.0+
if SYSTEM == "cuda" and not (is_sm8x or is_sm90):
raise ImportError(
f"GPU with CUDA capability {major} {minor} is not supported for Flash Attention V2"
)
FP8 support detection from `server/lorax_server/utils/torch_utils.py:17-22`:
def is_fp8_supported():
return (
torch.cuda.is_available()
and (torch.cuda.get_device_capability()[0] >= 9)
or (torch.cuda.get_device_capability()[0] == 8
and torch.cuda.get_device_capability()[1] >= 9)
)
Memory fraction control from `server/lorax_server/utils/dist.py:8-14`:
RANK = int(os.getenv("RANK", "0"))
WORLD_SIZE = int(os.getenv("WORLD_SIZE", "1"))
MEMORY_FRACTION = float(os.getenv("CUDA_MEMORY_FRACTION", "1.0"))
MEMORY_WIGGLE_ROOM = float(os.getenv("MEMORY_WIGGLE_ROOM", "0.9"))
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: CUDA is not available` | No NVIDIA GPU detected or drivers not installed | Install NVIDIA drivers and CUDA toolkit 12.4+ |
| `GPU with CUDA capability X Y is not supported for Flash Attention V2` | GPU compute capability < 8.0 (pre-Ampere) | Use GPU with SM 8.0+ (A100, A10G, etc.) or fall back to Flash Attention V1 |
| `Flash Attention is not installed` | Missing flash_attn CUDA bindings | Install via `make install-flash-attention-v2-cuda` or use official Docker image |
| `Could not import vllm paged attention` | vLLM custom ops not built | Rebuild with matching CUDA toolkit or use official Docker image |
| `AssertionError: Each process is one gpu` | WORLD_SIZE exceeds available GPU count | Set WORLD_SIZE <= number of available GPUs |
Compatibility Notes
- NVIDIA GPUs: Full support. SM 7.5 minimum (Flash Attn V1), SM 8.0 recommended (Flash Attn V2 + all quantization methods).
- AMD GPUs (ROCm): Supported for MI210/MI250 (gfx90a) and MI300 (gfx942). Flash Attention V2 uses Composable Kernel (CK) or Triton backend selectable via `ROCM_USE_FLASH_ATTN_V2_TRITON`.
- Intel XPU: Experimental. Requires `intel_extension_for_pytorch`. Does not support window attention.
- CPU: Fallback mode with no GPU acceleration. Missing Flash Attention, paged attention, and all CUDA kernels.
- Multi-GPU: Tensor parallelism supported via NCCL. Set `WORLD_SIZE` and `RANK` environment variables.