Implementation:Huggingface Diffusers LoRA Training Config
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Optimization, Training_Pipelines |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Concrete tool for configuring the optimizer, learning rate scheduler, and distributed training preparation for LoRA fine-tuning of diffusion models, as implemented in the Diffusers training examples.
Description
This pattern configures the full optimization stack for LoRA training. First, trainable parameters are extracted from the UNet (only the LoRA layers have requires_grad=True). The AdamW optimizer is initialized with these parameters and configurable hyperparameters. Optionally, 8-bit Adam from bitsandbytes can be used for memory savings. A learning rate scheduler is created using Diffusers' get_scheduler utility, which supports constant, linear, cosine, and polynomial schedules with optional warmup. Finally, accelerator.prepare() wraps the model, optimizer, dataloader, and scheduler for distributed training.
The training step count is computed based on the dataset size, number of epochs, gradient accumulation steps, and number of processes. When max_train_steps is not explicitly set, it is derived from num_train_epochs. The scheduler's warmup and total steps are scaled by accelerator.num_processes to account for distributed training.
Usage
Use this pattern when:
- Setting up the optimizer for LoRA fine-tuning
- Configuring a learning rate schedule with warmup
- Preparing models and data for distributed training with Accelerate
- You need support for 8-bit Adam to reduce memory usage
Code Reference
Source Location
- Repository: diffusers
- File:
examples/text_to_image/train_text_to_image_lora.py - Lines: 578-788
Signature
# Optimizer initialization
optimizer = torch.optim.AdamW(
lora_layers,
lr=args.learning_rate,
betas=(args.adam_beta1, args.adam_beta2),
weight_decay=args.adam_weight_decay,
eps=args.adam_epsilon,
)
# Learning rate scheduler
lr_scheduler = get_scheduler(
args.lr_scheduler,
optimizer=optimizer,
num_warmup_steps=num_warmup_steps_for_scheduler,
num_training_steps=num_training_steps_for_scheduler,
)
# Distributed preparation
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
unet, optimizer, train_dataloader, lr_scheduler
)
Import
import torch
from diffusers.optimization import get_scheduler
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lora_layers | iterator |
Yes | Iterator over trainable parameters (filtered by requires_grad).
|
| learning_rate | float |
Yes | Base learning rate. Typical values for LoRA: 1e-4 to 1e-3. |
| adam_beta1 | float |
No | First moment decay rate. Default: 0.9. |
| adam_beta2 | float |
No | Second moment decay rate. Default: 0.999. |
| adam_weight_decay | float |
No | Decoupled weight decay coefficient. Default: 1e-2. |
| adam_epsilon | float |
No | Numerical stability term. Default: 1e-8. |
| lr_scheduler | str |
No | Schedule type: "constant", "constant_with_warmup", "linear", "cosine", "cosine_with_restarts", "polynomial". Default: "constant".
|
| lr_warmup_steps | int |
No | Number of warmup steps for the scheduler. Default: 0. |
| use_8bit_adam | bool |
No | Use bitsandbytes 8-bit Adam for reduced memory usage. Default: False. |
| scale_lr | bool |
No | Scale learning rate by gradient accumulation steps, batch size, and number of processes. Default: False. |
Outputs
| Name | Type | Description |
|---|---|---|
| unet | torch.nn.Module (DDP-wrapped) |
UNet model wrapped for distributed training. |
| optimizer | torch.optim.Optimizer |
Configured optimizer for LoRA parameters. |
| train_dataloader | DataLoader |
DataLoader with distributed data sharding. |
| lr_scheduler | LRScheduler |
Learning rate scheduler synchronized with the optimizer. |
Usage Examples
Basic Usage
import torch
from diffusers.optimization import get_scheduler
# Collect trainable LoRA parameters
lora_layers = filter(lambda p: p.requires_grad, unet.parameters())
# Optional: scale learning rate for effective batch size
learning_rate = 1e-4
if scale_lr:
learning_rate = (
learning_rate * gradient_accumulation_steps
* train_batch_size * accelerator.num_processes
)
# Initialize optimizer
optimizer = torch.optim.AdamW(
lora_layers,
lr=learning_rate,
betas=(0.9, 0.999),
weight_decay=1e-2,
eps=1e-8,
)
# Compute training steps
num_warmup_steps = 500 * accelerator.num_processes
num_training_steps = num_epochs * steps_per_epoch * accelerator.num_processes
# Create learning rate scheduler
lr_scheduler = get_scheduler(
"constant_with_warmup",
optimizer=optimizer,
num_warmup_steps=num_warmup_steps,
num_training_steps=num_training_steps,
)
# Prepare for distributed training
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
unet, optimizer, train_dataloader, lr_scheduler
)