Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Diffusers Training Configuration

From Leeroopedia
Knowledge Sources
Domains Diffusion_Models, Optimization, Training_Pipelines
Last Updated 2026-02-13 21:00 GMT

Overview

Configuring the optimizer, learning rate scheduler, and distributed training preparation establishes the optimization strategy that controls how model parameters are updated during fine-tuning.

Description

Training configuration for LoRA fine-tuning involves three key components:

Optimizer selection: The AdamW optimizer is the standard choice for diffusion model training. It combines Adam's adaptive learning rates with decoupled weight decay regularization. For memory-constrained setups, 8-bit Adam (via bitsandbytes) provides roughly the same convergence with significantly reduced optimizer state memory.

Learning rate scheduling: A learning rate scheduler controls how the learning rate changes over the course of training. Common schedules include constant (with optional warmup), cosine annealing, and linear decay. Warmup gradually increases the learning rate from zero to the target value over the first N steps, preventing early instability when the LoRA weights are still near their zero initialization.

Distributed preparation: The accelerator.prepare() call wraps the model, optimizer, dataloader, and scheduler with distributed training functionality. This includes wrapping the model with DDP, sharding the dataloader across processes, and synchronizing optimizer states. After preparation, these objects can be used identically to their non-distributed counterparts.

Learning rate scaling: When using gradient accumulation or multiple GPUs, the effective batch size increases. The learning rate can optionally be scaled proportionally to maintain the same per-sample gradient magnitude:

scaled_lr = base_lr * gradient_accumulation_steps * batch_size * num_processes

Usage

Use this training configuration pattern when:

  • Setting up the optimization loop for LoRA fine-tuning
  • You need to choose between different learning rate schedules
  • Training on multiple GPUs with Accelerate
  • Memory is limited and you want to use 8-bit optimizers

Theoretical Basis

AdamW Optimizer

AdamW maintains per-parameter first and second moment estimates and applies decoupled weight decay:

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t           # first moment (mean)
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2          # second moment (variance)
m_hat = m_t / (1 - beta1^t)                           # bias correction
v_hat = v_t / (1 - beta2^t)                           # bias correction
theta_t = theta_{t-1} - lr * (m_hat / (sqrt(v_hat) + eps) + wd * theta_{t-1})

Default hyperparameters:
  beta1 = 0.9, beta2 = 0.999, eps = 1e-8, weight_decay = 1e-2

Learning Rate Warmup

Warmup linearly increases the learning rate from 0 to the target over the first N steps:

if step < warmup_steps:
    lr = target_lr * (step / warmup_steps)
else:
    lr = schedule(step - warmup_steps)   # constant, cosine, linear, etc.

Warmup is important for LoRA training because the adapter weights start at zero (or near zero), meaning initial gradients can be noisy and large. A gradual learning rate ramp prevents divergence.

Accelerator Preparation

accelerator.prepare() transforms training objects for distributed execution:

model     -> DistributedDataParallel(model)      # gradient sync
dataloader -> sharded DataLoader                  # data partitioning
optimizer  -> synchronized optimizer              # state sync
scheduler  -> step-adjusted scheduler             # accounts for num_processes

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment