Principle:AUTOMATIC1111 Stable diffusion webui Textual inversion training loop

Knowledge Sources	An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion Mixed Precision Training Decoupled Weight Decay Regularization (AdamW)
Domains	Textual Inversion, Training Loop, Optimization, Stable Diffusion
Last Updated	2026-02-08 00:00 GMT

Overview

The textual inversion training loop is the iterative optimization process that adjusts only the embedding vectors of a new pseudo-word token while keeping all other model parameters frozen, using gradient descent with mixed precision and gradient accumulation.

Description

Textual inversion training differs from standard neural network training in a fundamental way: the vast majority of the model's parameters are frozen. Only the embedding vector(s) for the new token are updated. This makes the optimization problem highly constrained -- the loss landscape is navigated entirely through a small set of embedding dimensions (typically 768 for SD 1.x or 1024/1280 for SD 2.x/SDXL per vector).

The training loop follows the standard denoising score matching objective used in diffusion models, but applied specifically to optimize the embedding:

Forward pass: Sample a random timestep, add noise to the pre-encoded latent, and predict the noise using the U-Net conditioned on the text embedding containing the new token
Loss computation: Compute the MSE between predicted and actual noise
Backward pass: Compute gradients with respect to the embedding vector only
Gradient accumulation: Accumulate gradients over multiple batches before performing an optimizer step
Gradient clipping: Optionally clip gradients by value or norm to prevent training instability
Optimizer step: Update the embedding vector using AdamW

Several key techniques make this process practical:

Mixed precision training: Using torch.cuda.amp.GradScaler and autocast to perform forward passes in float16 while maintaining float32 for gradient accumulation, reducing VRAM usage and increasing throughput
Gradient accumulation: Simulating larger effective batch sizes by accumulating gradients over multiple forward passes before updating, which stabilizes training when GPU memory limits the actual batch size
Periodic checkpointing: Saving intermediate embedding states at regular intervals for recovery and selection of the best training step
Preview generation: Periodically generating sample images to visually monitor training progress

Usage

Use this training loop approach when:

You want to teach a Stable Diffusion model a new concept through textual inversion
You need mixed precision training to fit within GPU memory constraints
You want gradient accumulation to achieve effective batch sizes larger than what fits in memory
You need periodic checkpointing for long training runs

Theoretical Basis

Denoising Score Matching Objective

The training objective minimizes:

L = E_{z, epsilon~N(0,1), t~U(1,T)} [ ||epsilon - epsilon_theta(z_t, t, c_theta(prompt_with_S*))||^2 ]

where:

$z$ is the pre-encoded latent of a training image
$z_{t}$ is the noised latent at timestep $t$
$ϵ_{θ}$ is the frozen U-Net denoiser
$c_{θ} (prompt)$ is the text conditioning from CLIP, which includes the learnable embedding $v_{*}$
Only $v_{*}$ receives gradients; all other parameters in $ϵ_{θ}$ and $c_{θ}$ are frozen

Mixed Precision with GradScaler

Mixed precision training uses float16 for forward and backward passes to save memory and increase speed, while maintaining a float32 master copy of the embedding for accurate gradient accumulation:

1. Forward pass in float16 via autocast
2. Scale the loss: scaled_loss = scaler.scale(loss)
3. Backward pass: scaled_loss.backward()
4. Unscale gradients: scaler.step(optimizer) internally unscales
5. Update scaler: scaler.update() adjusts the scale factor

The scaler dynamically adjusts the loss scaling factor to prevent float16 underflow in gradients while avoiding overflow.

Gradient Accumulation

With a gradient accumulation factor of $G$ , the effective batch size becomes $B_{eff} = B \times G$ . The loss is divided by $G$ before backpropagation so that accumulated gradients match the scale of a single large batch:

for j in range(G):
    loss = model.forward(batch[j]) / G
    loss.backward()  # gradients accumulate
optimizer.step()
optimizer.zero_grad()

Gradient Clipping

Two modes of gradient clipping prevent training instability:

Value clipping: clip_grad_value_(params, clip_value) clamps each gradient element to [-clip_value, clip_value]
Norm clipping: clip_grad_norm_(params, max_norm) rescales the gradient if its norm exceeds max_norm

The clipping threshold itself can follow a piecewise schedule (using the same LearnRateScheduler class), allowing tighter clipping as training converges.

AdamW Optimizer

AdamW is the optimizer of choice, applying decoupled weight decay regularization. For textual inversion, weight decay is set to 0.0 since the embedding vectors should not be explicitly regularized toward zero -- the CLIP embedding space structure provides implicit regularization.

Related Pages

Implemented By

Implementation:AUTOMATIC1111_Stable_diffusion_webui_Train_embedding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment