Environment:Unslothai Unsloth CUDA BitsAndBytes
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, Quantization |
| Last Updated | 2026-02-07 09:00 GMT |
Overview
NVIDIA GPU environment with CUDA 11.8-13.0, bitsandbytes 0.45.5+, and Triton 3.0+ for 4-bit QLoRA training with fused Triton kernels.
Description
This environment extends the Python_Transformers base with GPU-specific requirements for 4-bit quantized model loading and training. It requires an NVIDIA GPU (or AMD/Intel GPU with reduced feature set) with CUDA toolkit, the bitsandbytes library for NF4 quantization, and Triton for fused custom kernels (RoPE, RMSNorm, cross-entropy, SwiGLU, GeGLU, LoRA MLP). The environment auto-detects GPU architecture (Ampere, Hopper, Blackwell) and adjusts behavior accordingly. bfloat16 training requires compute capability >= 8.0 (Ampere or newer).
Usage
Use this environment for any workflow that requires 4-bit quantized model loading (QLoRA), LoRA adapter injection with fused kernels, or vision model fine-tuning. This is the primary GPU environment for SFT training workflows.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+ recommended) | Windows via WSL2; macOS not supported |
| Hardware | NVIDIA GPU with compute capability >= 7.0 | Ampere (sm80+) recommended for bfloat16 support |
| VRAM | Minimum 8GB | 16GB+ recommended for 7B+ models |
| CUDA | 11.8, 12.1, 12.4, 12.6, 12.8, or 13.0 | Must match PyTorch CUDA version exactly |
| Disk | 50GB+ SSD | For model weights, merged outputs, and intermediate files |
Dependencies
System Packages
- `cuda-toolkit` = 11.8 / 12.1 / 12.4 / 12.6 / 12.8 / 13.0 (must match torch CUDA version)
- `cudnn` (bundled with CUDA toolkit or separate install)
- NVIDIA GPU driver compatible with chosen CUDA version
Python Packages
- `torch` >= 2.1.0, matched to CUDA version (e.g., `torch==2.4.0+cu121`)
- `bitsandbytes` >= 0.45.5, !=0.46.0, !=0.48.0
- `triton` >= 3.0.0 (Linux only)
- `xformers` >= 0.0.22.post7 (optional, Windows; version-pinned per CUDA/torch combo)
- `flash-attn` >= 2.6.3 (optional, recommended for Gemma 2 and attention-heavy models)
- All packages from Python_Transformers environment
Credentials
- `HF_TOKEN`: HuggingFace API token (for gated model access like Llama, Gemma).
Quick Install
# Install Unsloth with CUDA support (example for CUDA 12.1 + Torch 2.4)
pip install unsloth[cu121-torch240]
# Or manually install GPU packages
pip install "torch>=2.4.0" --index-url https://download.pytorch.org/whl/cu121
pip install "bitsandbytes>=0.45.5" "triton>=3.0.0"
# Optional: flash-attention for Gemma 2 and other models
pip install "flash-attn>=2.6.3"
Code Evidence
GPU device detection from `device_type.py:37-50`:
@functools.cache
def get_device_type():
if hasattr(torch, "cuda") and torch.cuda.is_available():
if is_hip():
return "hip"
return "cuda"
elif hasattr(torch, "xpu") and torch.xpu.is_available():
return "xpu"
raise NotImplementedError("Unsloth currently only works on NVIDIA, AMD and Intel GPUs.")
CUDA version validation from `_auto_install.py:19-41`:
v = V(re.match(r"[0-9\.]{3,}", torch.__version__).group(0))
cuda = str(torch.version.cuda)
is_ampere = torch.cuda.get_device_capability()[0] >= 8
if cuda not in ("11.8", "12.1", "12.4", "12.6", "12.8", "13.0"):
raise RuntimeError(...)
if v <= V('2.1.0'):
raise RuntimeError(f"Torch = {v} too old!")
bfloat16 support detection from `__init__.py:183-185`:
if DEVICE_TYPE == "cuda":
major_version, minor_version = torch.cuda.get_device_capability()
SUPPORTS_BFLOAT16 = major_version >= 8
BitsAndBytes CUDA stream support from `kernels/utils.py:111-114`:
import bitsandbytes as bnb
HAS_CUDA_STREAM = Version(bnb.__version__) > Version("0.43.3")
Triton API version branching from `kernels/utils.py:62-72`:
if Version(triton.__version__) >= Version("3.0.0"):
if DEVICE_TYPE == "xpu":
triton_tanh = tl.extra.intel.libdevice.tanh
else:
from triton.language.extra import libdevice
triton_tanh = libdevice.tanh
triton_cast = tl.cast
else:
triton_tanh = tl.math.tanh
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `NotImplementedError: Unsloth currently only works on NVIDIA, AMD and Intel GPUs` | No supported GPU detected | Install NVIDIA/AMD/Intel GPU drivers and CUDA toolkit |
| `RuntimeError: Torch = X too old!` | PyTorch version below 2.1.0 | Upgrade PyTorch: `pip install torch>=2.4.0` |
| `RuntimeError: CUDA version not in (11.8, 12.1, ...)` | Unsupported CUDA version | Install a supported CUDA version matching torch |
| `Running ldconfig /usr/lib64-nvidia to link CUDA` | CUDA libraries not linked | Usually auto-fixed; if persistent, run `ldconfig /usr/lib64-nvidia` manually |
| `If you want to finetune Gemma 2, install flash-attn` | flash-attn not installed for Gemma 2 | `pip install flash-attn>=2.6.3` |
| `RuntimeError: Intel xpu currently supports unsloth with torch.version >= 2.6.0` | Intel XPU with old torch | Upgrade to `torch>=2.6.0` for Intel XPU |
Compatibility Notes
- AMD GPUs (ROCm/HIP): bitsandbytes requires >= 0.48.3 for AMD. Pre-quantized 4-bit models may not work on AMD Instinct GPUs (MI series) due to warp size differences (AMD Instinct uses blocksize 128 vs NVIDIA's 64). Radeon Navi GPUs (warp size 32) are compatible.
- Intel XPU: Requires PyTorch >= 2.6.0. Uses `pytorch_triton_xpu` instead of standard Triton.
- Blackwell GPUs (SM100): vLLM with torch < 2.9.0 crashes on Blackwell GPUs. Triton PDL is auto-disabled via `TRITON_DISABLE_PDL=1`.
- DGX Spark (GB10): `fast_inference=True` is currently broken on DGX Spark; Unsloth auto-disables it.
- Ampere+ (sm80+): Required for bfloat16 training. Older GPUs (Turing, Volta) fall back to float16.
- Torch AMP API: PyTorch < 2.4.0 uses `torch.cuda.amp.custom_fwd`; >= 2.4.0 uses `torch.amp.custom_fwd(device_type="cuda")`.
Related Pages
- Implementation:Unslothai_Unsloth_FastLanguageModel_From_Pretrained
- Implementation:Unslothai_Unsloth_FastVisionModel_From_Pretrained
- Implementation:Unslothai_Unsloth_FastLanguageModel_Get_Peft_Model
- Implementation:Unslothai_Unsloth_FastVisionModel_Get_Peft_Model
- Implementation:Unslothai_Unsloth_Evaluate_OCR_Model