Environment:Axolotl ai cloud Axolotl CUDA GPU
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, GPU_Computing |
| Last Updated | 2026-02-06 22:33 GMT |
Overview
NVIDIA GPU environment with CUDA support, bitsandbytes for quantization, and optional Flash Attention for accelerated single-GPU LLM fine-tuning.
Description
This environment extends the Python_Runtime with NVIDIA GPU hardware and CUDA-specific libraries required for GPU-accelerated training. It includes bitsandbytes for 4-bit and 8-bit quantization (QLoRA), xformers or Flash Attention for memory-efficient attention computation, and Triton for custom CUDA kernels. The runtime auto-detects CUDA availability and GPU compute capability, falling back to CPU when no GPU is present. FP8 training requires Hopper architecture (compute capability >= 9.0).
Usage
Use this environment for any single-GPU training workflow that requires GPU acceleration. This includes QLoRA fine-tuning, LoRA fine-tuning, model loading with quantization, LoRA merging, and DPO/RLHF training. It is the mandatory prerequisite for any Implementation that calls `torch.cuda` operations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+ recommended) | macOS MPS partially supported; NPU (Ascend) experimentally supported |
| Hardware | NVIDIA GPU with CUDA support | Minimum 16GB VRAM for 7B models (24GB+ recommended) |
| Hardware (FP8) | NVIDIA Hopper GPU (H100/H200) | Compute capability >= 9.0 required for FP8 training |
| CUDA | CUDA 11.8+ | Determined by torch wheel variant (cu118, cu121, cu126) |
| Disk | 50GB+ SSD | For model weights, checkpoints, and dataset caching |
Dependencies
System Packages
- `nvidia-driver` >= 525 (for CUDA 11.8+)
- `cuda-toolkit` (matching torch CUDA version)
Python Packages (GPU-Specific)
- `bitsandbytes` == 0.49.1 (4-bit/8-bit quantization)
- `triton` >= 3.0.0 (custom CUDA kernels)
- `xformers` >= 0.0.23.post1 (memory-efficient attention; version pinned per torch)
- `liger-kernel` == 0.6.4 (optimized CUDA kernels)
- `nvidia-ml-py` == 12.560.30 (GPU monitoring)
- `torchao` == 0.13.0 (quantization-aware training)
Optional GPU Packages
- `flash-attn` == 2.8.3 (Flash Attention 2 for fast attention)
- `mamba-ssm` == 1.2.0.post1 + `causal_conv1d` (Mamba state-space models)
- `auto-gptq` == 0.5.1 (GPTQ quantization support)
- `fbgemm-gpu-genai` == 1.3.0 (Facebook GEMM GPU kernels)
- `vllm` (version depends on torch: 0.10.0 to 0.14.0)
Credentials
No GPU-specific credentials required. See Environment:Axolotl_ai_cloud_Axolotl_Python_Runtime for general credentials.
The following GPU-related environment variables are auto-configured:
- `PYTORCH_CUDA_ALLOC_CONF`: Auto-set to `expandable_segments:True,roundup_power2_divisions:16` for torch >= 2.2.
- `PYTORCH_ALLOC_CONF`: Auto-set for torch >= 2.9 (renamed from PYTORCH_CUDA_ALLOC_CONF).
- `XFORMERS_IGNORE_FLASH_VERSION_CHECK`: Auto-set to "1" to suppress version mismatch warnings.
Quick Install
# Install Axolotl with Flash Attention support
pip install axolotl[flash-attn]
# Or install Flash Attention separately
pip install flash-attn==2.8.3 --no-build-isolation
# Verify CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU: {torch.cuda.get_device_name(0)}'); print(f'Compute capability: {torch.cuda.get_device_capability()}')"
Code Evidence
CUDA availability detection from `src/axolotl/utils/config/__init__.py:27-41`:
def choose_device(cfg):
try:
if torch.cuda.is_available():
return f"cuda:{cfg.local_rank}"
if torch.backends.mps.is_available():
return "mps"
if is_torch_npu_available():
return f"npu:{cfg.local_rank}"
raise SystemError("No CUDA/mps/npu device found")
except Exception:
return "cpu"
FP8 compute capability check from `src/axolotl/cli/config.py:267-272`:
def compute_supports_fp8() -> bool:
try:
compute_capability = torch.cuda.get_device_capability()
return compute_capability >= (9, 0)
except RuntimeError:
return False
Flash Attention availability detection from `src/axolotl/loaders/model.py:147-149`:
@cached_property
def has_flash_attn(self) -> bool:
return find_spec("flash_attn") is not None
GPU compute capability detection from `src/axolotl/cli/config.py:219-223`:
try:
device_props = torch.cuda.get_device_properties("cuda")
gpu_version = "sm_" + str(device_props.major) + str(device_props.minor)
except:
gpu_version = None
MPS/NPU device mapping from `src/axolotl/loaders/model.py:498-502`:
cur_device = get_device_type()
if "mps" in str(cur_device):
self.model_kwargs["device_map"] = "mps:0"
elif "npu" in str(cur_device):
self.model_kwargs["device_map"] = "npu:0"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `SystemError: No CUDA/mps/npu device found` | No GPU detected | Install NVIDIA drivers and CUDA toolkit; verify with `nvidia-smi` |
| `CUDA out of memory` | Insufficient VRAM for model/batch size | Reduce `micro_batch_size`, enable `gradient_checkpointing`, use QLoRA instead of LoRA |
| `RuntimeError: expected mat1 and mat2 to have the same dtype` | FP16/dtype mismatch during full fine-tune with sample packing | Use LoRA adapter or enable Flash Attention |
| `ImportError: flash_attn` | Flash Attention not installed | `pip install flash-attn==2.8.3 --no-build-isolation` |
| FP8 config fails silently | GPU compute capability < 9.0 | FP8 requires Hopper (H100/H200) GPUs with compute capability >= 9.0 |
Compatibility Notes
- NVIDIA GPUs: Full support for CUDA 11.8+. Recommended: A100 (40GB/80GB), H100, RTX 3090/4090.
- Apple MPS: Partially supported with device_map="mps:0". Quantization (bitsandbytes) not available.
- Huawei NPU: Experimentally supported via `is_torch_npu_available()`.
- FP8 Training: Only available on Hopper architecture (H100/H200). Compute capability >= 9.0 required.
- Mamba Models: Require `mamba-ssm` extra and explicitly set device to `torch.cuda.current_device()`.
- CUDA Memory Allocator: Automatically configured with `expandable_segments:True` for torch >= 2.2 to reduce fragmentation.
Related Pages
- Implementation:Axolotl_ai_cloud_Axolotl_ModelLoader_Load
- Implementation:Axolotl_ai_cloud_Axolotl_Load_Lora
- Implementation:Axolotl_ai_cloud_Axolotl_HFCausalTrainerBuilder_Build
- Implementation:Axolotl_ai_cloud_Axolotl_Do_Merge_Lora
- Implementation:Axolotl_ai_cloud_Axolotl_Setup_Reference_Model
- Implementation:Axolotl_ai_cloud_Axolotl_HFRLTrainerBuilder_Build
- Implementation:Axolotl_ai_cloud_Axolotl_ModelLoader_Load_Multimodal