Principle:AUTOMATIC1111 Stable diffusion webui Hypernetwork training configuration

Knowledge Sources	Decoupled Weight Decay Regularization (AdamW) Cyclical Learning Rates for Training Neural Networks
Domains	Deep Learning, Optimization, Training Configuration
Last Updated	2026-02-08 00:00 GMT

Overview

Hypernetwork training configuration encompasses the selection and scheduling of optimizers, learning rates, and gradient clipping strategies that govern how the auxiliary hypernetwork weights are updated during training within a diffusion model framework.

Description

Training a hypernetwork involves optimizing the weights of small MLP modules that modify cross-attention behavior, while the base diffusion model remains frozen. The training configuration must address several concerns that differ from standard model training:

Optimizer Selection:

Hypernetwork training uses a full optimizer (typically AdamW) that maintains momentum and adaptive learning rate state for all hypernetwork parameters. This differs from textual inversion, which optimizes a single embedding vector.
The optimizer operates on the collected .weights() of the hypernetwork, which includes all Linear and LayerNorm parameters across all paired modules.
Any PyTorch optimizer class (from torch.optim) can be used, with the choice stored in the checkpoint for resumption.

Learning Rate Scheduling:

A step-based schedule allows different learning rates for different training phases, specified as a comma-separated string (e.g., "0.001:100, 0.00001:1000, 1e-5:10000").
Each segment defines a rate and the step at which to transition to the next rate.
The scheduler integrates with the optimizer by directly modifying param_groups['lr'].

Gradient Clipping:

Two gradient clipping strategies are available: value clipping (clip_grad_value_) and norm clipping (clip_grad_norm_).
The clipping threshold can follow its own schedule, allowing aggressive clipping early in training and relaxation later.
Gradient clipping is critical for hypernetwork training stability because the residual architecture can amplify gradients through the chain of attention layers.

Usage

Use hypernetwork training configuration when:

Setting up a training run for hypernetwork modules.
You need multi-phase learning rate schedules for long training runs.
Training instability requires gradient clipping to prevent divergence.

Theoretical Basis

AdamW for Hypernetwork Parameters

The default optimizer is AdamW, which applies decoupled weight decay regularization:

m_t = beta1 * m_{t-1} + (1 - beta1) * g_t           # First moment
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2          # Second moment
m_hat = m_t / (1 - beta1^t)                           # Bias correction
v_hat = v_t / (1 - beta2^t)                           # Bias correction
theta_t = theta_{t-1} - lr * (m_hat / (sqrt(v_hat) + eps) + wd * theta_{t-1})

AdamW is preferred over standard Adam for hypernetwork training because the decoupled weight decay provides better regularization, preventing the small hypernetwork weights from growing unbounded.

Step-Based Learning Rate Scheduling

The schedule is parsed from a string format that supports multiple phases:

"0.005:100, 0.0005:1000, 0.00005:5000"

Phase 1: lr = 0.005  for steps [0, 100)
Phase 2: lr = 0.0005 for steps [100, 1000)
Phase 3: lr = 0.00005 for steps [1000, 5000)

A single value without a colon (e.g., "0.001") applies the same rate for the entire training duration. The special step value -1 means "until the end of training."

Gradient Clipping Strategies

Value clipping limits each gradient component individually:

g_i = clamp(g_i, -clip_value, clip_value)

Norm clipping scales the entire gradient vector if its norm exceeds the threshold:

if ||g|| > max_norm:
    g = g * (max_norm / ||g||)

Norm clipping preserves the direction of the gradient while controlling its magnitude, making it generally preferred for stable convergence in hypernetwork training.

Optimizer State Persistence

The optimizer state dictionary (momentum buffers, adaptive learning rates) can be saved alongside the hypernetwork checkpoint in a separate .optim file. This enables exact resumption of training, as the optimizer state is often as important as the model weights for convergence. A hash verification ensures the optimizer state matches the hypernetwork weights before loading.

Related Pages

Implemented By

Implementation:AUTOMATIC1111_Stable_diffusion_webui_LearnRateScheduler_for_hypernetwork

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment