Environment:Axolotl ai cloud Axolotl CUDA GPU

Knowledge Sources	Axolotl NVIDIA CUDA Toolkit
Domains	Infrastructure, Deep_Learning, GPU_Computing
Last Updated	2026-02-06 22:33 GMT

Overview

NVIDIA GPU environment with CUDA support, bitsandbytes for quantization, and optional Flash Attention for accelerated single-GPU LLM fine-tuning.

Description

This environment extends the Python_Runtime with NVIDIA GPU hardware and CUDA-specific libraries required for GPU-accelerated training. It includes bitsandbytes for 4-bit and 8-bit quantization (QLoRA), xformers or Flash Attention for memory-efficient attention computation, and Triton for custom CUDA kernels. The runtime auto-detects CUDA availability and GPU compute capability, falling back to CPU when no GPU is present. FP8 training requires Hopper architecture (compute capability >= 9.0).

Usage

Use this environment for any single-GPU training workflow that requires GPU acceleration. This includes QLoRA fine-tuning, LoRA fine-tuning, model loading with quantization, LoRA merging, and DPO/RLHF training. It is the mandatory prerequisite for any Implementation that calls `torch.cuda` operations.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+ recommended)	macOS MPS partially supported; NPU (Ascend) experimentally supported
Hardware	NVIDIA GPU with CUDA support	Minimum 16GB VRAM for 7B models (24GB+ recommended)
Hardware (FP8)	NVIDIA Hopper GPU (H100/H200)	Compute capability >= 9.0 required for FP8 training
CUDA	CUDA 11.8+	Determined by torch wheel variant (cu118, cu121, cu126)
Disk	50GB+ SSD	For model weights, checkpoints, and dataset caching

Dependencies

System Packages

`nvidia-driver` >= 525 (for CUDA 11.8+)
`cuda-toolkit` (matching torch CUDA version)

Python Packages (GPU-Specific)

`bitsandbytes` == 0.49.1 (4-bit/8-bit quantization)
`triton` >= 3.0.0 (custom CUDA kernels)
`xformers` >= 0.0.23.post1 (memory-efficient attention; version pinned per torch)
`liger-kernel` == 0.6.4 (optimized CUDA kernels)
`nvidia-ml-py` == 12.560.30 (GPU monitoring)
`torchao` == 0.13.0 (quantization-aware training)

Optional GPU Packages

`flash-attn` == 2.8.3 (Flash Attention 2 for fast attention)
`mamba-ssm` == 1.2.0.post1 + `causal_conv1d` (Mamba state-space models)
`auto-gptq` == 0.5.1 (GPTQ quantization support)
`fbgemm-gpu-genai` == 1.3.0 (Facebook GEMM GPU kernels)
`vllm` (version depends on torch: 0.10.0 to 0.14.0)

Credentials

No GPU-specific credentials required. See Environment:Axolotl_ai_cloud_Axolotl_Python_Runtime for general credentials.

The following GPU-related environment variables are auto-configured:

`PYTORCH_CUDA_ALLOC_CONF`: Auto-set to `expandable_segments:True,roundup_power2_divisions:16` for torch >= 2.2.
`PYTORCH_ALLOC_CONF`: Auto-set for torch >= 2.9 (renamed from PYTORCH_CUDA_ALLOC_CONF).
`XFORMERS_IGNORE_FLASH_VERSION_CHECK`: Auto-set to "1" to suppress version mismatch warnings.

Quick Install

# Install Axolotl with Flash Attention support
pip install axolotl[flash-attn]

# Or install Flash Attention separately
pip install flash-attn==2.8.3 --no-build-isolation

# Verify CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU: {torch.cuda.get_device_name(0)}'); print(f'Compute capability: {torch.cuda.get_device_capability()}')"

Code Evidence

CUDA availability detection from `src/axolotl/utils/config/__init__.py:27-41`:

def choose_device(cfg):
    try:
        if torch.cuda.is_available():
            return f"cuda:{cfg.local_rank}"
        if torch.backends.mps.is_available():
            return "mps"
        if is_torch_npu_available():
            return f"npu:{cfg.local_rank}"
        raise SystemError("No CUDA/mps/npu device found")
    except Exception:
        return "cpu"

FP8 compute capability check from `src/axolotl/cli/config.py:267-272`:

def compute_supports_fp8() -> bool:
    try:
        compute_capability = torch.cuda.get_device_capability()
        return compute_capability >= (9, 0)
    except RuntimeError:
        return False

Flash Attention availability detection from `src/axolotl/loaders/model.py:147-149`:

@cached_property
def has_flash_attn(self) -> bool:
    return find_spec("flash_attn") is not None

GPU compute capability detection from `src/axolotl/cli/config.py:219-223`:

try:
    device_props = torch.cuda.get_device_properties("cuda")
    gpu_version = "sm_" + str(device_props.major) + str(device_props.minor)
except:
    gpu_version = None

MPS/NPU device mapping from `src/axolotl/loaders/model.py:498-502`:

cur_device = get_device_type()
if "mps" in str(cur_device):
    self.model_kwargs["device_map"] = "mps:0"
elif "npu" in str(cur_device):
    self.model_kwargs["device_map"] = "npu:0"

Common Errors

Error Message	Cause	Solution
`SystemError: No CUDA/mps/npu device found`	No GPU detected	Install NVIDIA drivers and CUDA toolkit; verify with `nvidia-smi`
`CUDA out of memory`	Insufficient VRAM for model/batch size	Reduce `micro_batch_size`, enable `gradient_checkpointing`, use QLoRA instead of LoRA
`RuntimeError: expected mat1 and mat2 to have the same dtype`	FP16/dtype mismatch during full fine-tune with sample packing	Use LoRA adapter or enable Flash Attention
`ImportError: flash_attn`	Flash Attention not installed	`pip install flash-attn==2.8.3 --no-build-isolation`
FP8 config fails silently	GPU compute capability < 9.0	FP8 requires Hopper (H100/H200) GPUs with compute capability >= 9.0

Compatibility Notes

NVIDIA GPUs: Full support for CUDA 11.8+. Recommended: A100 (40GB/80GB), H100, RTX 3090/4090.
Apple MPS: Partially supported with device_map="mps:0". Quantization (bitsandbytes) not available.
Huawei NPU: Experimentally supported via `is_torch_npu_available()`.
FP8 Training: Only available on Hopper architecture (H100/H200). Compute capability >= 9.0 required.
Mamba Models: Require `mamba-ssm` extra and explicitly set device to `torch.cuda.current_device()`.
CUDA Memory Allocator: Automatically configured with `expandable_segments:True` for torch >= 2.2 to reduce fragmentation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment