Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Diffusers LoRA Training Config

From Leeroopedia
Knowledge Sources
Domains Diffusion_Models, Optimization, Training_Pipelines
Last Updated 2026-02-13 21:00 GMT

Overview

Concrete tool for configuring the optimizer, learning rate scheduler, and distributed training preparation for LoRA fine-tuning of diffusion models, as implemented in the Diffusers training examples.

Description

This pattern configures the full optimization stack for LoRA training. First, trainable parameters are extracted from the UNet (only the LoRA layers have requires_grad=True). The AdamW optimizer is initialized with these parameters and configurable hyperparameters. Optionally, 8-bit Adam from bitsandbytes can be used for memory savings. A learning rate scheduler is created using Diffusers' get_scheduler utility, which supports constant, linear, cosine, and polynomial schedules with optional warmup. Finally, accelerator.prepare() wraps the model, optimizer, dataloader, and scheduler for distributed training.

The training step count is computed based on the dataset size, number of epochs, gradient accumulation steps, and number of processes. When max_train_steps is not explicitly set, it is derived from num_train_epochs. The scheduler's warmup and total steps are scaled by accelerator.num_processes to account for distributed training.

Usage

Use this pattern when:

  • Setting up the optimizer for LoRA fine-tuning
  • Configuring a learning rate schedule with warmup
  • Preparing models and data for distributed training with Accelerate
  • You need support for 8-bit Adam to reduce memory usage

Code Reference

Source Location

  • Repository: diffusers
  • File: examples/text_to_image/train_text_to_image_lora.py
  • Lines: 578-788

Signature

# Optimizer initialization
optimizer = torch.optim.AdamW(
    lora_layers,
    lr=args.learning_rate,
    betas=(args.adam_beta1, args.adam_beta2),
    weight_decay=args.adam_weight_decay,
    eps=args.adam_epsilon,
)

# Learning rate scheduler
lr_scheduler = get_scheduler(
    args.lr_scheduler,
    optimizer=optimizer,
    num_warmup_steps=num_warmup_steps_for_scheduler,
    num_training_steps=num_training_steps_for_scheduler,
)

# Distributed preparation
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
    unet, optimizer, train_dataloader, lr_scheduler
)

Import

import torch
from diffusers.optimization import get_scheduler

I/O Contract

Inputs

Name Type Required Description
lora_layers iterator Yes Iterator over trainable parameters (filtered by requires_grad).
learning_rate float Yes Base learning rate. Typical values for LoRA: 1e-4 to 1e-3.
adam_beta1 float No First moment decay rate. Default: 0.9.
adam_beta2 float No Second moment decay rate. Default: 0.999.
adam_weight_decay float No Decoupled weight decay coefficient. Default: 1e-2.
adam_epsilon float No Numerical stability term. Default: 1e-8.
lr_scheduler str No Schedule type: "constant", "constant_with_warmup", "linear", "cosine", "cosine_with_restarts", "polynomial". Default: "constant".
lr_warmup_steps int No Number of warmup steps for the scheduler. Default: 0.
use_8bit_adam bool No Use bitsandbytes 8-bit Adam for reduced memory usage. Default: False.
scale_lr bool No Scale learning rate by gradient accumulation steps, batch size, and number of processes. Default: False.

Outputs

Name Type Description
unet torch.nn.Module (DDP-wrapped) UNet model wrapped for distributed training.
optimizer torch.optim.Optimizer Configured optimizer for LoRA parameters.
train_dataloader DataLoader DataLoader with distributed data sharding.
lr_scheduler LRScheduler Learning rate scheduler synchronized with the optimizer.

Usage Examples

Basic Usage

import torch
from diffusers.optimization import get_scheduler

# Collect trainable LoRA parameters
lora_layers = filter(lambda p: p.requires_grad, unet.parameters())

# Optional: scale learning rate for effective batch size
learning_rate = 1e-4
if scale_lr:
    learning_rate = (
        learning_rate * gradient_accumulation_steps
        * train_batch_size * accelerator.num_processes
    )

# Initialize optimizer
optimizer = torch.optim.AdamW(
    lora_layers,
    lr=learning_rate,
    betas=(0.9, 0.999),
    weight_decay=1e-2,
    eps=1e-8,
)

# Compute training steps
num_warmup_steps = 500 * accelerator.num_processes
num_training_steps = num_epochs * steps_per_epoch * accelerator.num_processes

# Create learning rate scheduler
lr_scheduler = get_scheduler(
    "constant_with_warmup",
    optimizer=optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps,
)

# Prepare for distributed training
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
    unet, optimizer, train_dataloader, lr_scheduler
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment