Environment:Huggingface Diffusers Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Training |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Training environment for Diffusers fine-tuning workflows: requires accelerate >= 0.31.0, peft >= 0.17.0, transformers >= 4.41.2, and a CUDA/XPU GPU.
Description
This environment extends the base PyTorch CUDA runtime with libraries needed for fine-tuning diffusion models. The core training stack includes HuggingFace Accelerate for distributed training and mixed precision, PEFT for parameter-efficient fine-tuning (LoRA adapters), and the Datasets library for data loading. The PEFT backend is auto-enabled when peft >= 0.6.0 and transformers >= 4.34.0 are both installed (controlled by the `USE_PEFT_BACKEND` flag in `constants.py`). Training requires a GPU — the MPS backend does not support training (`BACKEND_SUPPORTS_TRAINING["mps"] = False`).
Usage
Required for LoRA fine-tuning, DreamBooth personalization, and any custom training workflow using the `examples/` training scripts. This environment is the prerequisite for all training-related Implementation pages.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended) | Full CUDA + NCCL support for distributed training |
| Hardware | NVIDIA GPU with >= 16GB VRAM | 24GB+ recommended for SDXL fine-tuning; A100/H100 for large models |
| Disk | 50GB+ SSD | For datasets, checkpoints, and model weights |
| RAM | 32GB+ | For dataset preprocessing and CPU offloading during training |
Dependencies
System Packages
- CUDA toolkit >= 11.8 (for PyTorch 2.x builds)
- NCCL (for multi-GPU distributed training)
Python Packages
Core training stack:
- `accelerate` >= 0.31.0
- `peft` >= 0.17.0
- `transformers` >= 4.41.2
- `datasets`
- `safetensors` >= 0.3.1
Optimizer and scheduling:
- `torch` >= 2.0.0 (for fused AdamW and compiled training)
- `prodigyopt` (optional — for Prodigy optimizer in DreamBooth)
Logging and monitoring:
- `tensorboard`
- `wandb` (optional — for Weights & Biases logging)
Data processing:
- `Pillow`
- `torchvision`
- `Jinja2`
Credentials
The following environment variables may be needed:
- `HF_TOKEN`: HuggingFace API token for gated model access and hub uploads
- `WANDB_API_KEY`: Weights & Biases API key for experiment tracking (optional)
Quick Install
# Install diffusers with training support
pip install diffusers[training] transformers accelerate peft datasets tensorboard
# For LoRA fine-tuning example
pip install diffusers[torch] transformers accelerate peft datasets safetensors
Code Evidence
PEFT backend auto-detection from `constants.py:50-61`:
MIN_PEFT_VERSION = "0.6.0"
MIN_TRANSFORMERS_VERSION = "4.34.0"
_required_peft_version = is_peft_available() and version.parse(
version.parse(importlib.metadata.version("peft")).base_version
) >= version.parse(MIN_PEFT_VERSION)
_required_transformers_version = is_transformers_available() and version.parse(
version.parse(importlib.metadata.version("transformers")).base_version
) >= version.parse(MIN_TRANSFORMERS_VERSION)
USE_PEFT_BACKEND = _required_peft_version and _required_transformers_version
Accelerate version requirement for CPU offloading from `pipeline_utils.py:1202-1205`:
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook
else:
raise ImportError("`enable_model_cpu_offload` requires `accelerate v0.17.0` or higher.")
MPS training limitation from `torch_utils.py:31-40`:
BACKEND_SUPPORTS_TRAINING = {
"cuda": True,
"xpu": True,
"cpu": True,
"mps": False,
"default": True,
}
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `enable_model_cpu_offload requires accelerate v0.17.0 or higher` | Old accelerate version | `pip install -U accelerate` |
| `Using bitsandbytes 4-bit quantization requires Accelerate >= 0.26.0` | Missing/old accelerate for quantized training | `pip install 'accelerate>=0.26.0'` |
| `BACKEND_SUPPORTS_TRAINING: mps = False` | Attempting training on Apple Silicon | Use CUDA GPU for training; MPS inference-only |
Compatibility Notes
- Multi-GPU: Requires NCCL backend for distributed training. Set up via `accelerate config`.
- Mixed Precision: `fp16` and `bf16` supported via Accelerate. BF16 requires Ampere+ GPU (A100/H100).
- Gradient Checkpointing: Supported for all UNet/Transformer models to reduce VRAM.
- TF32: Examples enable TF32 for faster Ampere GPU training: `torch.backends.cuda.matmul.allow_tf32 = True`.
Related Pages
- Implementation:Huggingface_Diffusers_Accelerator_Setup
- Implementation:Huggingface_Diffusers_LoRA_Training_Loop
- Implementation:Huggingface_Diffusers_LoRA_Training_Config
- Implementation:Huggingface_Diffusers_LoRA_Dataset_Pipeline
- Implementation:Huggingface_Diffusers_Log_Validation
- Implementation:Huggingface_Diffusers_DreamBooth_Training_Loop
- Implementation:Huggingface_Diffusers_DreamBooth_Dataset_Class
- Implementation:Huggingface_Diffusers_PeftAdapterMixin_Add_Adapter
- Implementation:Huggingface_Diffusers_Dual_LoRA_Add_Adapter