Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:AUTOMATIC1111 Stable diffusion webui Textual inversion training

From Leeroopedia


Knowledge Sources
Domains Training, Stable_Diffusion, Textual_Inversion, Fine_Tuning
Last Updated 2026-02-08 08:00 GMT

Overview

End-to-end process for training textual inversion embeddings that teach Stable Diffusion new concepts, subjects, or styles through learnable token vectors.

Description

This workflow trains a textual inversion embedding: a small set of learnable vectors in the CLIP text encoder's embedding space that represent a new concept (a specific person, object, or artistic style). The training process optimizes these vectors so that when used in a prompt, they guide the diffusion model to produce images matching the training data. The trained embedding is a lightweight file (typically a few KB) that can be shared and used with any compatible checkpoint without modifying the base model.

Usage

Execute this workflow when you want to teach Stable Diffusion a new concept that it cannot reproduce from text descriptions alone, such as a specific person's likeness, a particular art style, or a custom object. Textual inversion is appropriate when you have 3-20 training images and want a portable, lightweight result.

Execution Steps

Step 1: Embedding creation

Create a new embedding by specifying a unique token name, the number of vectors per token, and an initialization strategy. The token name becomes the keyword used in prompts to invoke the concept. More vectors per token provide greater representational capacity but require more training data. Initialization can use an existing word's vectors as a starting point (e.g., initializing a person embedding from "person") or random initialization.

Key considerations:

  • Token names should be unique and not conflict with existing vocabulary words
  • 1-4 vectors per token is typical; more vectors increase capacity but slow convergence
  • Initializing from a semantically related word speeds up training
  • The embedding is tied to a specific model architecture (SD 1.x vs SDXL) by vector dimension

Step 2: Training dataset preparation

Prepare a directory of training images that represent the target concept. Images should be consistent, well-cropped, and varied in pose, lighting, and context. A prompt template file maps each training image to a text prompt containing the embedding token and contextual descriptions. Templates support variable substitution for the embedding name and optional per-image captions from text files.

Key considerations:

  • 5-20 high-quality images typically produce good results
  • Images should be cropped to the training resolution (typically 512x512 for SD 1.x)
  • Template files control the training prompt structure (e.g., "a photo of [name]" for subjects)
  • Per-image caption files (same name as image with .txt extension) enable varied descriptions
  • The preprocessing tab can help with cropping, captioning, and augmentation

Step 3: Training hyperparameter configuration

Configure the training parameters: learning rate (with optional piecewise schedule), batch size, gradient accumulation steps, total training steps, and checkpoint save interval. Set the learning rate schedule as a string of "rate:step" pairs for piecewise decay. Configure image augmentation options (horizontal flip, random crop) and VAE latent caching behavior.

Key considerations:

  • A typical learning rate starts at 0.005 and decreases over training
  • Piecewise schedules allow ramping down the learning rate at specific steps
  • Gradient accumulation effectively increases batch size without additional memory
  • Saving checkpoints at regular intervals enables selecting the best training point
  • Latent caching pre-encodes images through the VAE once, speeding up subsequent epochs

Step 4: Training loop execution

Execute the training loop. For each step: sample a batch of training images, apply augmentation, encode through the VAE to get latents, construct the training prompt with the embedding token, encode the prompt through CLIP (with the learnable embedding injected), run a forward pass through the UNet to predict noise, compute the MSE loss between predicted and actual noise, backpropagate gradients, and update only the embedding vectors via the optimizer. Periodically generate sample images and save checkpoint embeddings.

Key considerations:

  • Only the embedding vectors are updated; the model weights remain frozen
  • Training typically runs for 1,000-10,000 steps depending on concept complexity
  • Sample images generated during training help assess convergence visually
  • The loss value should generally decrease but may plateau; visual quality is the true metric
  • Optimizer state can be saved and resumed for interrupted training sessions

Step 5: Embedding evaluation and saving

Evaluate the trained embedding by generating test images with various prompts containing the embedding token. Compare outputs against the training images to assess concept fidelity and prompt responsiveness. The final embedding is saved as a .pt or .safetensors file in the embeddings directory. Embeddings can optionally be encoded into images (steganography) for easy sharing.

Key considerations:

  • Test with diverse prompts to verify the embedding generalizes beyond the training context
  • Over-trained embeddings may resist style changes or composition modifications
  • The embedding file contains metadata (training step, model hash, token name)
  • Embedding files are small (typically 4-64 KB) and easily shared
  • Place embedding files in the embeddings directory for automatic discovery

Execution Diagram

GitHub URL

Workflow Repository