Environment:Predibase Lorax CUDA GPU Runtime

Knowledge Sources	Predibase LoRAX NVIDIA CUDA Toolkit
Domains	Infrastructure, GPU_Computing
Last Updated	2026-02-08 02:30 GMT

Overview

NVIDIA CUDA GPU runtime environment with compute capability 7.5+ (Turing or newer), CUDA 12.4, and PyTorch 2.4+ for LoRAX inference serving.

Description

This environment provides the GPU acceleration context required to run the LoRAX inference server. LoRAX detects the GPU platform at startup via `torch.version.cuda`, `torch.version.hip` (ROCm), or `intel_extension_for_pytorch` (XPU) and selects the appropriate backend. The primary target is NVIDIA CUDA GPUs with specific compute capability tiers gating different features:

SM 7.5 (Turing): Minimum for Flash Attention V1 support (e.g., T4, RTX 2080)
SM 8.0+ (Ampere): Required for Flash Attention V2, ExLLaMA kernels, Punica SGMV, EETQ, AWQ (e.g., A100, A10G)
SM 8.9+ (Ada Lovelace): Required for FP8 quantization (e.g., RTX 4090, L4)
SM 9.0 (Hopper): Full support including FP8 and all kernel optimizations (e.g., H100)

AMD ROCm is supported for MI210/MI250/MI300 GPUs. Intel XPU is experimentally supported.

Usage

This environment is a mandatory prerequisite for running any LoRAX model serving workflow. All Flash Attention, paged attention (vLLM), custom CUDA kernels, and LoRA kernel operations require GPU acceleration. Without a compatible GPU, the server falls back to CPU mode with severely degraded performance and missing features.

System Requirements

Category	Requirement	Notes
OS	Ubuntu 22.04 LTS	Docker base image: `nvidia/cuda:12.4.0-base-ubuntu22.04`
Hardware	NVIDIA GPU with SM 7.5+	Minimum: T4 (16GB); Recommended: A100 (40/80GB) or H100 (80GB)
VRAM	16GB minimum	Model-dependent; 7B models need ~16GB, 70B models need 80GB+ or multi-GPU
CUDA Version	12.4	Docker build uses `nvidia/cuda:12.4.0-devel-ubuntu22.04`
CUDA Driver	550+	Compatible with CUDA 12.4 toolkit
Disk	50GB+ SSD	Model weights cached under `HUGGINGFACE_HUB_CACHE`

Dependencies

System Packages

`nvidia-cuda-runtime-cu12` = 12.1.105
`nvidia-cudnn-cu12` = 9.1.0.70
`nvidia-nccl-cu12` = 2.20.5
`ninja-build` (for kernel compilation)
`cmake` >= 3.30.0 (for vLLM kernel build)

Python Packages

`torch` >= 2.4.0 (pinned 2.6.0 in requirements.txt)
`triton` = 3.0.0 (Linux x86_64 only)
`flash-attn` (V1 or V2 CUDA bindings)
`flashinfer` = 0.1.6 (cu124, optional backend)
`vllm` (custom ops for paged attention)

CUDA Kernel Packages (built from source in Docker)

`custom_kernels` (SM 8.0, compute_80)
`exllama_kernels` (SM 8.0+, for GPTQ V1)
`exllamav2_kernels` (SM 8.0+, for GPTQ V2)
`punica_kernels` (SM 8.0+, for LoRA SGMV/BGMV)
`EETQ` (SM 8.0+, for 8-bit quantization)
`vllm_flash_attn` (SM 7.0-9.0+)

Credentials

No GPU-specific credentials required. See Environment:Predibase_Lorax_Model_Source_Credentials for model access tokens.

Quick Install

# Recommended: Use the official Docker image
docker pull ghcr.io/predibase/lorax:latest

# Manual install (requires CUDA 12.4 toolkit pre-installed):
pip install torch==2.6.0 triton==3.0.0
pip install flash-attn --no-build-isolation
pip install flashinfer==0.1.6 -i https://flashinfer.ai/whl/cu124/torch2.4/

# Build custom kernels (requires ninja, cmake):
cd server && make install install-flash-attention-v2-cuda

Code Evidence

System detection from `server/lorax_server/utils/import_utils.py:26-44`:

SYSTEM = None
if torch.version.hip is not None:
    SYSTEM = "rocm"
elif torch.version.cuda is not None and torch.cuda.is_available():
    SYSTEM = "cuda"
elif is_xpu_available():
    SYSTEM = "xpu"
else:
    SYSTEM = "cpu"

GPU capability validation from `server/lorax_server/utils/flash_attn.py:57-93`:

if SYSTEM in {"cuda", "rocm"}:
    if not torch.cuda.is_available():
        raise ImportError("CUDA is not available")

    major, minor = torch.cuda.get_device_capability()
    is_sm75 = major == 7 and minor == 5
    is_sm8x = major == 8 and minor >= 0
    is_sm90 = major == 9 and minor == 0

    # Flash Attention V2 requires SM 8.0+
    if SYSTEM == "cuda" and not (is_sm8x or is_sm90):
        raise ImportError(
            f"GPU with CUDA capability {major} {minor} is not supported for Flash Attention V2"
        )

FP8 support detection from `server/lorax_server/utils/torch_utils.py:17-22`:

def is_fp8_supported():
    return (
        torch.cuda.is_available()
        and (torch.cuda.get_device_capability()[0] >= 9)
        or (torch.cuda.get_device_capability()[0] == 8
            and torch.cuda.get_device_capability()[1] >= 9)
    )

Memory fraction control from `server/lorax_server/utils/dist.py:8-14`:

RANK = int(os.getenv("RANK", "0"))
WORLD_SIZE = int(os.getenv("WORLD_SIZE", "1"))
MEMORY_FRACTION = float(os.getenv("CUDA_MEMORY_FRACTION", "1.0"))
MEMORY_WIGGLE_ROOM = float(os.getenv("MEMORY_WIGGLE_ROOM", "0.9"))

Common Errors

Error Message	Cause	Solution
`ImportError: CUDA is not available`	No NVIDIA GPU detected or drivers not installed	Install NVIDIA drivers and CUDA toolkit 12.4+
`GPU with CUDA capability X Y is not supported for Flash Attention V2`	GPU compute capability < 8.0 (pre-Ampere)	Use GPU with SM 8.0+ (A100, A10G, etc.) or fall back to Flash Attention V1
`Flash Attention is not installed`	Missing flash_attn CUDA bindings	Install via `make install-flash-attention-v2-cuda` or use official Docker image
`Could not import vllm paged attention`	vLLM custom ops not built	Rebuild with matching CUDA toolkit or use official Docker image
`AssertionError: Each process is one gpu`	WORLD_SIZE exceeds available GPU count	Set WORLD_SIZE <= number of available GPUs

Compatibility Notes

NVIDIA GPUs: Full support. SM 7.5 minimum (Flash Attn V1), SM 8.0 recommended (Flash Attn V2 + all quantization methods).
AMD GPUs (ROCm): Supported for MI210/MI250 (gfx90a) and MI300 (gfx942). Flash Attention V2 uses Composable Kernel (CK) or Triton backend selectable via `ROCM_USE_FLASH_ATTN_V2_TRITON`.
Intel XPU: Experimental. Requires `intel_extension_for_pytorch`. Does not support window attention.
CPU: Fallback mode with no GPU acceleration. Missing Flash Attention, paged attention, and all CUDA kernels.
Multi-GPU: Tensor parallelism supported via NCCL. Set `WORLD_SIZE` and `RANK` environment variables.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment