Workflow:AUTOMATIC1111 Stable diffusion webui Hypernetwork training
| Knowledge Sources | |
|---|---|
| Domains | Training, Stable_Diffusion, Hypernetworks, Fine_Tuning |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
End-to-end process for training hypernetwork modules that modify Stable Diffusion's cross-attention layers to learn new styles or concepts.
Description
This workflow trains a hypernetwork: a small auxiliary neural network that modifies the key and value projections in the cross-attention layers of the UNet. Unlike textual inversion which learns in embedding space, hypernetworks learn transformations in the model's internal representation space, providing stronger stylistic influence. The hypernetwork consists of fully-connected layers with configurable architecture (layer structure, activation functions, dropout, normalization). Once trained, the hypernetwork file can be loaded alongside any compatible checkpoint to apply the learned style.
Usage
Execute this workflow when you want to teach Stable Diffusion a strong stylistic influence that textual inversion cannot adequately capture, such as a particular artist's rendering style, lighting characteristics, or compositional patterns. Hypernetworks provide more expressive power than embeddings but are larger files and specific to the model's attention dimensions.
Execution Steps
Step 1: Hypernetwork creation
Create a new hypernetwork by specifying its name, layer structure, activation function, weight initialization method, and optional features (layer normalization, dropout). The layer structure defines the multiplicative factors for hidden layer sizes relative to the input dimension (e.g., "1, 2, 1" creates a bottleneck architecture). Select from activation functions including ReLU, LeakyReLU, ELU, Tanh, Sigmoid, or linear.
Key considerations:
- Layer structure "1, 2, 1" is a common starting configuration
- ReLU or LeakyReLU activations are typical choices for stability
- Dropout rate of 0.0-0.3 helps prevent overfitting on small datasets
- Weight initialization (Normal, Xavier, or Kaiming) affects training dynamics
- Enable sizes should match the model's attention dimensions (typically 320, 640, 768, 1280 for SD 1.x)
Step 2: Training dataset preparation
Prepare a directory of training images representing the target style or concept. Use prompt templates that describe the content of each image while incorporating varied contexts. The dataset module handles image loading, resizing, augmentation (random crop, horizontal flip), and latent space pre-encoding through the VAE. Per-image text captions can be provided via companion .txt files.
Key considerations:
- 20-100 images typically produce good style transfer results
- Images should showcase the desired style across varied subjects
- Higher resolution training images may be center-cropped to the training resolution
- The dataset supports caching VAE-encoded latents for faster training
- Template prompts should describe image content rather than the style itself
Step 3: Training configuration
Configure training hyperparameters: learning rate (with optional piecewise schedule), batch size, gradient accumulation steps, total training steps, and optimizer selection. Set checkpoint save intervals and sample generation frequency. Configure the learning rate schedule as comma-separated "rate:step" pairs for annealing.
Key considerations:
- Typical learning rates range from 0.00001 to 0.0001
- Lower learning rates produce more stable but slower training
- Gradient accumulation increases effective batch size without additional VRAM
- Save checkpoints frequently to select the best training point
- Sample generation during training provides visual progress monitoring
Step 4: Training loop execution
Execute the training loop. For each step: load a batch of training images, encode through the VAE, construct training prompts, enable the hypernetwork to modify the cross-attention key and value projections, run a UNet forward pass to predict noise, compute MSE loss, backpropagate through both the UNet (frozen) and the hypernetwork (trainable), and update hypernetwork weights. The hypernetwork's forward method inserts additional fully-connected layers that transform the attention keys and values before the cross-attention computation.
Key considerations:
- Only the hypernetwork parameters are updated; UNet weights remain frozen
- The hypernetwork modifies cross-attention at multiple resolution levels
- Training typically runs for 5,000-50,000 steps depending on complexity
- Monitor both loss values and generated samples to assess convergence
- Gradient checkpointing can reduce memory usage at the cost of speed
Step 5: Evaluation and deployment
Evaluate the trained hypernetwork by generating images with various prompts while the hypernetwork is active. Test with different subjects and compositions to verify the style transfers consistently. The hypernetwork is saved as a .pt file that can be loaded from the Settings or applied via the extra networks UI. Adjust the hypernetwork strength (0.0-1.0) during inference to control the intensity of the style effect.
Key considerations:
- Hypernetwork strength controls how much influence the network has during generation
- Over-trained hypernetworks may override prompt content with training data patterns
- Hypernetwork files are larger than embeddings (typically 10-100 MB)
- Compatible with any checkpoint that shares the same attention dimensions
- Multiple hypernetworks cannot be active simultaneously (unlike LoRAs)