Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Axolotl ai cloud Axolotl CUDA GPU

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, GPU_Computing
Last Updated 2026-02-06 22:33 GMT

Overview

NVIDIA GPU environment with CUDA support, bitsandbytes for quantization, and optional Flash Attention for accelerated single-GPU LLM fine-tuning.

Description

This environment extends the Python_Runtime with NVIDIA GPU hardware and CUDA-specific libraries required for GPU-accelerated training. It includes bitsandbytes for 4-bit and 8-bit quantization (QLoRA), xformers or Flash Attention for memory-efficient attention computation, and Triton for custom CUDA kernels. The runtime auto-detects CUDA availability and GPU compute capability, falling back to CPU when no GPU is present. FP8 training requires Hopper architecture (compute capability >= 9.0).

Usage

Use this environment for any single-GPU training workflow that requires GPU acceleration. This includes QLoRA fine-tuning, LoRA fine-tuning, model loading with quantization, LoRA merging, and DPO/RLHF training. It is the mandatory prerequisite for any Implementation that calls `torch.cuda` operations.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+ recommended) macOS MPS partially supported; NPU (Ascend) experimentally supported
Hardware NVIDIA GPU with CUDA support Minimum 16GB VRAM for 7B models (24GB+ recommended)
Hardware (FP8) NVIDIA Hopper GPU (H100/H200) Compute capability >= 9.0 required for FP8 training
CUDA CUDA 11.8+ Determined by torch wheel variant (cu118, cu121, cu126)
Disk 50GB+ SSD For model weights, checkpoints, and dataset caching

Dependencies

System Packages

  • `nvidia-driver` >= 525 (for CUDA 11.8+)
  • `cuda-toolkit` (matching torch CUDA version)

Python Packages (GPU-Specific)

  • `bitsandbytes` == 0.49.1 (4-bit/8-bit quantization)
  • `triton` >= 3.0.0 (custom CUDA kernels)
  • `xformers` >= 0.0.23.post1 (memory-efficient attention; version pinned per torch)
  • `liger-kernel` == 0.6.4 (optimized CUDA kernels)
  • `nvidia-ml-py` == 12.560.30 (GPU monitoring)
  • `torchao` == 0.13.0 (quantization-aware training)

Optional GPU Packages

  • `flash-attn` == 2.8.3 (Flash Attention 2 for fast attention)
  • `mamba-ssm` == 1.2.0.post1 + `causal_conv1d` (Mamba state-space models)
  • `auto-gptq` == 0.5.1 (GPTQ quantization support)
  • `fbgemm-gpu-genai` == 1.3.0 (Facebook GEMM GPU kernels)
  • `vllm` (version depends on torch: 0.10.0 to 0.14.0)

Credentials

No GPU-specific credentials required. See Environment:Axolotl_ai_cloud_Axolotl_Python_Runtime for general credentials.

The following GPU-related environment variables are auto-configured:

  • `PYTORCH_CUDA_ALLOC_CONF`: Auto-set to `expandable_segments:True,roundup_power2_divisions:16` for torch >= 2.2.
  • `PYTORCH_ALLOC_CONF`: Auto-set for torch >= 2.9 (renamed from PYTORCH_CUDA_ALLOC_CONF).
  • `XFORMERS_IGNORE_FLASH_VERSION_CHECK`: Auto-set to "1" to suppress version mismatch warnings.

Quick Install

# Install Axolotl with Flash Attention support
pip install axolotl[flash-attn]

# Or install Flash Attention separately
pip install flash-attn==2.8.3 --no-build-isolation

# Verify CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU: {torch.cuda.get_device_name(0)}'); print(f'Compute capability: {torch.cuda.get_device_capability()}')"

Code Evidence

CUDA availability detection from `src/axolotl/utils/config/__init__.py:27-41`:

def choose_device(cfg):
    try:
        if torch.cuda.is_available():
            return f"cuda:{cfg.local_rank}"
        if torch.backends.mps.is_available():
            return "mps"
        if is_torch_npu_available():
            return f"npu:{cfg.local_rank}"
        raise SystemError("No CUDA/mps/npu device found")
    except Exception:
        return "cpu"

FP8 compute capability check from `src/axolotl/cli/config.py:267-272`:

def compute_supports_fp8() -> bool:
    try:
        compute_capability = torch.cuda.get_device_capability()
        return compute_capability >= (9, 0)
    except RuntimeError:
        return False

Flash Attention availability detection from `src/axolotl/loaders/model.py:147-149`:

@cached_property
def has_flash_attn(self) -> bool:
    return find_spec("flash_attn") is not None

GPU compute capability detection from `src/axolotl/cli/config.py:219-223`:

try:
    device_props = torch.cuda.get_device_properties("cuda")
    gpu_version = "sm_" + str(device_props.major) + str(device_props.minor)
except:
    gpu_version = None

MPS/NPU device mapping from `src/axolotl/loaders/model.py:498-502`:

cur_device = get_device_type()
if "mps" in str(cur_device):
    self.model_kwargs["device_map"] = "mps:0"
elif "npu" in str(cur_device):
    self.model_kwargs["device_map"] = "npu:0"

Common Errors

Error Message Cause Solution
`SystemError: No CUDA/mps/npu device found` No GPU detected Install NVIDIA drivers and CUDA toolkit; verify with `nvidia-smi`
`CUDA out of memory` Insufficient VRAM for model/batch size Reduce `micro_batch_size`, enable `gradient_checkpointing`, use QLoRA instead of LoRA
`RuntimeError: expected mat1 and mat2 to have the same dtype` FP16/dtype mismatch during full fine-tune with sample packing Use LoRA adapter or enable Flash Attention
`ImportError: flash_attn` Flash Attention not installed `pip install flash-attn==2.8.3 --no-build-isolation`
FP8 config fails silently GPU compute capability < 9.0 FP8 requires Hopper (H100/H200) GPUs with compute capability >= 9.0

Compatibility Notes

  • NVIDIA GPUs: Full support for CUDA 11.8+. Recommended: A100 (40GB/80GB), H100, RTX 3090/4090.
  • Apple MPS: Partially supported with device_map="mps:0". Quantization (bitsandbytes) not available.
  • Huawei NPU: Experimentally supported via `is_torch_npu_available()`.
  • FP8 Training: Only available on Hopper architecture (H100/H200). Compute capability >= 9.0 required.
  • Mamba Models: Require `mamba-ssm` extra and explicitly set device to `torch.cuda.current_device()`.
  • CUDA Memory Allocator: Automatically configured with `expandable_segments:True` for torch >= 2.2 to reduce fragmentation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment