Principle:Huggingface Diffusers Training Configuration
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Optimization, Training_Pipelines |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Configuring the optimizer, learning rate scheduler, and distributed training preparation establishes the optimization strategy that controls how model parameters are updated during fine-tuning.
Description
Training configuration for LoRA fine-tuning involves three key components:
Optimizer selection: The AdamW optimizer is the standard choice for diffusion model training. It combines Adam's adaptive learning rates with decoupled weight decay regularization. For memory-constrained setups, 8-bit Adam (via bitsandbytes) provides roughly the same convergence with significantly reduced optimizer state memory.
Learning rate scheduling: A learning rate scheduler controls how the learning rate changes over the course of training. Common schedules include constant (with optional warmup), cosine annealing, and linear decay. Warmup gradually increases the learning rate from zero to the target value over the first N steps, preventing early instability when the LoRA weights are still near their zero initialization.
Distributed preparation: The accelerator.prepare() call wraps the model, optimizer, dataloader, and scheduler with distributed training functionality. This includes wrapping the model with DDP, sharding the dataloader across processes, and synchronizing optimizer states. After preparation, these objects can be used identically to their non-distributed counterparts.
Learning rate scaling: When using gradient accumulation or multiple GPUs, the effective batch size increases. The learning rate can optionally be scaled proportionally to maintain the same per-sample gradient magnitude:
scaled_lr = base_lr * gradient_accumulation_steps * batch_size * num_processes
Usage
Use this training configuration pattern when:
- Setting up the optimization loop for LoRA fine-tuning
- You need to choose between different learning rate schedules
- Training on multiple GPUs with Accelerate
- Memory is limited and you want to use 8-bit optimizers
Theoretical Basis
AdamW Optimizer
AdamW maintains per-parameter first and second moment estimates and applies decoupled weight decay:
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t # first moment (mean)
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2 # second moment (variance)
m_hat = m_t / (1 - beta1^t) # bias correction
v_hat = v_t / (1 - beta2^t) # bias correction
theta_t = theta_{t-1} - lr * (m_hat / (sqrt(v_hat) + eps) + wd * theta_{t-1})
Default hyperparameters:
beta1 = 0.9, beta2 = 0.999, eps = 1e-8, weight_decay = 1e-2
Learning Rate Warmup
Warmup linearly increases the learning rate from 0 to the target over the first N steps:
if step < warmup_steps:
lr = target_lr * (step / warmup_steps)
else:
lr = schedule(step - warmup_steps) # constant, cosine, linear, etc.
Warmup is important for LoRA training because the adapter weights start at zero (or near zero), meaning initial gradients can be noisy and large. A gradual learning rate ramp prevents divergence.
Accelerator Preparation
accelerator.prepare() transforms training objects for distributed execution:
model -> DistributedDataParallel(model) # gradient sync
dataloader -> sharded DataLoader # data partitioning
optimizer -> synchronized optimizer # state sync
scheduler -> step-adjusted scheduler # accounts for num_processes