Workflow:AUTOMATIC1111 Stable diffusion webui Hypernetwork training

Knowledge Sources	Stable Diffusion WebUI Stable Diffusion WebUI Wiki
Domains	Training, Stable_Diffusion, Hypernetworks, Fine_Tuning
Last Updated	2026-02-08 08:00 GMT

Overview

End-to-end process for training hypernetwork modules that modify Stable Diffusion's cross-attention layers to learn new styles or concepts.

Description

This workflow trains a hypernetwork: a small auxiliary neural network that modifies the key and value projections in the cross-attention layers of the UNet. Unlike textual inversion which learns in embedding space, hypernetworks learn transformations in the model's internal representation space, providing stronger stylistic influence. The hypernetwork consists of fully-connected layers with configurable architecture (layer structure, activation functions, dropout, normalization). Once trained, the hypernetwork file can be loaded alongside any compatible checkpoint to apply the learned style.

Usage

Execute this workflow when you want to teach Stable Diffusion a strong stylistic influence that textual inversion cannot adequately capture, such as a particular artist's rendering style, lighting characteristics, or compositional patterns. Hypernetworks provide more expressive power than embeddings but are larger files and specific to the model's attention dimensions.

Execution Steps

Step 1: Hypernetwork creation

Create a new hypernetwork by specifying its name, layer structure, activation function, weight initialization method, and optional features (layer normalization, dropout). The layer structure defines the multiplicative factors for hidden layer sizes relative to the input dimension (e.g., "1, 2, 1" creates a bottleneck architecture). Select from activation functions including ReLU, LeakyReLU, ELU, Tanh, Sigmoid, or linear.

Key considerations:

Layer structure "1, 2, 1" is a common starting configuration
ReLU or LeakyReLU activations are typical choices for stability
Dropout rate of 0.0-0.3 helps prevent overfitting on small datasets
Weight initialization (Normal, Xavier, or Kaiming) affects training dynamics
Enable sizes should match the model's attention dimensions (typically 320, 640, 768, 1280 for SD 1.x)

Step 2: Training dataset preparation

Prepare a directory of training images representing the target style or concept. Use prompt templates that describe the content of each image while incorporating varied contexts. The dataset module handles image loading, resizing, augmentation (random crop, horizontal flip), and latent space pre-encoding through the VAE. Per-image text captions can be provided via companion .txt files.

Key considerations:

20-100 images typically produce good style transfer results
Images should showcase the desired style across varied subjects
Higher resolution training images may be center-cropped to the training resolution
The dataset supports caching VAE-encoded latents for faster training
Template prompts should describe image content rather than the style itself

Step 3: Training configuration

Configure training hyperparameters: learning rate (with optional piecewise schedule), batch size, gradient accumulation steps, total training steps, and optimizer selection. Set checkpoint save intervals and sample generation frequency. Configure the learning rate schedule as comma-separated "rate:step" pairs for annealing.

Key considerations:

Typical learning rates range from 0.00001 to 0.0001
Lower learning rates produce more stable but slower training
Gradient accumulation increases effective batch size without additional VRAM
Save checkpoints frequently to select the best training point
Sample generation during training provides visual progress monitoring

Step 4: Training loop execution

Execute the training loop. For each step: load a batch of training images, encode through the VAE, construct training prompts, enable the hypernetwork to modify the cross-attention key and value projections, run a UNet forward pass to predict noise, compute MSE loss, backpropagate through both the UNet (frozen) and the hypernetwork (trainable), and update hypernetwork weights. The hypernetwork's forward method inserts additional fully-connected layers that transform the attention keys and values before the cross-attention computation.

Key considerations:

Only the hypernetwork parameters are updated; UNet weights remain frozen
The hypernetwork modifies cross-attention at multiple resolution levels
Training typically runs for 5,000-50,000 steps depending on complexity
Monitor both loss values and generated samples to assess convergence
Gradient checkpointing can reduce memory usage at the cost of speed

Step 5: Evaluation and deployment

Evaluate the trained hypernetwork by generating images with various prompts while the hypernetwork is active. Test with different subjects and compositions to verify the style transfers consistently. The hypernetwork is saved as a .pt file that can be loaded from the Settings or applied via the extra networks UI. Adjust the hypernetwork strength (0.0-1.0) during inference to control the intensity of the style effect.

Key considerations:

Hypernetwork strength controls how much influence the network has during generation
Over-trained hypernetworks may override prompt content with training data patterns
Hypernetwork files are larger than embeddings (typically 10-100 MB)
Compatible with any checkpoint that shares the same attention dimensions
Multiple hypernetworks cannot be active simultaneously (unlike LoRAs)

Execution Diagram

GitHub URL

Workflow Repository