Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Unslothai Unsloth CUDA BitsAndBytes

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, Quantization
Last Updated 2026-02-07 09:00 GMT

Overview

NVIDIA GPU environment with CUDA 11.8-13.0, bitsandbytes 0.45.5+, and Triton 3.0+ for 4-bit QLoRA training with fused Triton kernels.

Description

This environment extends the Python_Transformers base with GPU-specific requirements for 4-bit quantized model loading and training. It requires an NVIDIA GPU (or AMD/Intel GPU with reduced feature set) with CUDA toolkit, the bitsandbytes library for NF4 quantization, and Triton for fused custom kernels (RoPE, RMSNorm, cross-entropy, SwiGLU, GeGLU, LoRA MLP). The environment auto-detects GPU architecture (Ampere, Hopper, Blackwell) and adjusts behavior accordingly. bfloat16 training requires compute capability >= 8.0 (Ampere or newer).

Usage

Use this environment for any workflow that requires 4-bit quantized model loading (QLoRA), LoRA adapter injection with fused kernels, or vision model fine-tuning. This is the primary GPU environment for SFT training workflows.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+ recommended) Windows via WSL2; macOS not supported
Hardware NVIDIA GPU with compute capability >= 7.0 Ampere (sm80+) recommended for bfloat16 support
VRAM Minimum 8GB 16GB+ recommended for 7B+ models
CUDA 11.8, 12.1, 12.4, 12.6, 12.8, or 13.0 Must match PyTorch CUDA version exactly
Disk 50GB+ SSD For model weights, merged outputs, and intermediate files

Dependencies

System Packages

  • `cuda-toolkit` = 11.8 / 12.1 / 12.4 / 12.6 / 12.8 / 13.0 (must match torch CUDA version)
  • `cudnn` (bundled with CUDA toolkit or separate install)
  • NVIDIA GPU driver compatible with chosen CUDA version

Python Packages

  • `torch` >= 2.1.0, matched to CUDA version (e.g., `torch==2.4.0+cu121`)
  • `bitsandbytes` >= 0.45.5, !=0.46.0, !=0.48.0
  • `triton` >= 3.0.0 (Linux only)
  • `xformers` >= 0.0.22.post7 (optional, Windows; version-pinned per CUDA/torch combo)
  • `flash-attn` >= 2.6.3 (optional, recommended for Gemma 2 and attention-heavy models)
  • All packages from Python_Transformers environment

Credentials

  • `HF_TOKEN`: HuggingFace API token (for gated model access like Llama, Gemma).

Quick Install

# Install Unsloth with CUDA support (example for CUDA 12.1 + Torch 2.4)
pip install unsloth[cu121-torch240]

# Or manually install GPU packages
pip install "torch>=2.4.0" --index-url https://download.pytorch.org/whl/cu121
pip install "bitsandbytes>=0.45.5" "triton>=3.0.0"

# Optional: flash-attention for Gemma 2 and other models
pip install "flash-attn>=2.6.3"

Code Evidence

GPU device detection from `device_type.py:37-50`:

@functools.cache
def get_device_type():
    if hasattr(torch, "cuda") and torch.cuda.is_available():
        if is_hip():
            return "hip"
        return "cuda"
    elif hasattr(torch, "xpu") and torch.xpu.is_available():
        return "xpu"
    raise NotImplementedError("Unsloth currently only works on NVIDIA, AMD and Intel GPUs.")

CUDA version validation from `_auto_install.py:19-41`:

v = V(re.match(r"[0-9\.]{3,}", torch.__version__).group(0))
cuda = str(torch.version.cuda)
is_ampere = torch.cuda.get_device_capability()[0] >= 8
if cuda not in ("11.8", "12.1", "12.4", "12.6", "12.8", "13.0"):
    raise RuntimeError(...)
if v <= V('2.1.0'):
    raise RuntimeError(f"Torch = {v} too old!")

bfloat16 support detection from `__init__.py:183-185`:

if DEVICE_TYPE == "cuda":
    major_version, minor_version = torch.cuda.get_device_capability()
    SUPPORTS_BFLOAT16 = major_version >= 8

BitsAndBytes CUDA stream support from `kernels/utils.py:111-114`:

import bitsandbytes as bnb
HAS_CUDA_STREAM = Version(bnb.__version__) > Version("0.43.3")

Triton API version branching from `kernels/utils.py:62-72`:

if Version(triton.__version__) >= Version("3.0.0"):
    if DEVICE_TYPE == "xpu":
        triton_tanh = tl.extra.intel.libdevice.tanh
    else:
        from triton.language.extra import libdevice
        triton_tanh = libdevice.tanh
    triton_cast = tl.cast
else:
    triton_tanh = tl.math.tanh

Common Errors

Error Message Cause Solution
`NotImplementedError: Unsloth currently only works on NVIDIA, AMD and Intel GPUs` No supported GPU detected Install NVIDIA/AMD/Intel GPU drivers and CUDA toolkit
`RuntimeError: Torch = X too old!` PyTorch version below 2.1.0 Upgrade PyTorch: `pip install torch>=2.4.0`
`RuntimeError: CUDA version not in (11.8, 12.1, ...)` Unsupported CUDA version Install a supported CUDA version matching torch
`Running ldconfig /usr/lib64-nvidia to link CUDA` CUDA libraries not linked Usually auto-fixed; if persistent, run `ldconfig /usr/lib64-nvidia` manually
`If you want to finetune Gemma 2, install flash-attn` flash-attn not installed for Gemma 2 `pip install flash-attn>=2.6.3`
`RuntimeError: Intel xpu currently supports unsloth with torch.version >= 2.6.0` Intel XPU with old torch Upgrade to `torch>=2.6.0` for Intel XPU

Compatibility Notes

  • AMD GPUs (ROCm/HIP): bitsandbytes requires >= 0.48.3 for AMD. Pre-quantized 4-bit models may not work on AMD Instinct GPUs (MI series) due to warp size differences (AMD Instinct uses blocksize 128 vs NVIDIA's 64). Radeon Navi GPUs (warp size 32) are compatible.
  • Intel XPU: Requires PyTorch >= 2.6.0. Uses `pytorch_triton_xpu` instead of standard Triton.
  • Blackwell GPUs (SM100): vLLM with torch < 2.9.0 crashes on Blackwell GPUs. Triton PDL is auto-disabled via `TRITON_DISABLE_PDL=1`.
  • DGX Spark (GB10): `fast_inference=True` is currently broken on DGX Spark; Unsloth auto-disables it.
  • Ampere+ (sm80+): Required for bfloat16 training. Older GPUs (Turing, Volta) fall back to float16.
  • Torch AMP API: PyTorch < 2.4.0 uses `torch.cuda.amp.custom_fwd`; >= 2.4.0 uses `torch.amp.custom_fwd(device_type="cuda")`.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment