Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:AUTOMATIC1111 Stable diffusion webui Text to image generation

From Leeroopedia


Knowledge Sources
Domains Image_Generation, Stable_Diffusion, Generative_AI
Last Updated 2026-02-08 08:00 GMT

Overview

End-to-end process for generating images from text prompts using Stable Diffusion models with optional high-resolution upscaling.

Description

This workflow covers the complete text-to-image generation pipeline in the AUTOMATIC1111 WebUI. The user provides a text prompt describing the desired image and a negative prompt specifying undesired elements. The system encodes these prompts into CLIP conditioning vectors, initializes random latent noise from a seed, and iteratively denoises the latent using a selected sampler and noise schedule. The resulting latent is decoded through the VAE into a pixel image. An optional high-resolution fix pass can upscale the initial result and re-denoise it at higher resolution for improved detail.

Usage

Execute this workflow when you have a text description of an image you want to generate and a loaded Stable Diffusion checkpoint (SD 1.x, SD 2.x, SDXL, or SD3). This is the primary generation mode and the most common entry point for users of the WebUI.

Execution Steps

Step 1: Prompt composition

Compose the positive and negative text prompts that describe the desired and undesired image content. The prompt parser supports attention weighting with parentheses (word:1.2), scheduling [from:to:step], alternating tokens [a|b], the BREAK keyword for manual chunk splitting, and composable prompts using AND. Optionally apply saved prompt styles to prepend or append standard prompt fragments.

Key considerations:

  • Token limit per chunk is 77 (CLIP); prompts exceeding this are split into multiple chunks automatically
  • Attention weighting defaults to 1.0; values above 1.0 increase emphasis, below 1.0 decrease it
  • Negative prompts follow the same syntax as positive prompts

Step 2: Parameter configuration

Configure the generation parameters: sampling method (Euler, DPM++ 2M, DDIM, etc.), noise schedule type (Karras, Exponential, Beta, etc.), number of sampling steps, CFG (Classifier-Free Guidance) scale, output image dimensions, batch size, batch count, and the random seed. These parameters control the quality, style, and reproducibility of the output.

Key considerations:

  • Higher CFG scale increases prompt adherence but may reduce image quality at extreme values
  • More sampling steps generally improve quality with diminishing returns
  • Seed value of -1 generates a random seed; fixed seeds enable reproducible results

Step 3: Model and VAE loading

Load the selected Stable Diffusion checkpoint and VAE into GPU memory. The system detects the model architecture (SD 1.x, SD 2.x, SDXL, SD3, or variants like InstructPix2Pix) from the state dict structure and applies the correct configuration. Attention optimizations (xformers, scaled dot product, sub-quadratic) are applied via model hijacking. The VAE can be overridden separately from the checkpoint.

Key considerations:

  • Model type is auto-detected from state dict keys and shapes
  • Safe tensor loading with class whitelist prevents arbitrary code execution
  • Model caching avoids reloading when switching back to previously used checkpoints

Step 4: Prompt encoding

Encode the positive and negative prompts into conditioning tensors using the CLIP text encoder. Prompts are tokenized into 77-token chunks, with textual inversion embeddings injected for any referenced embedding tokens. Attention weights from the prompt syntax are applied to modify token emphasis. For SDXL, both CLIP-L and CLIP-G encoders produce separate conditioning vectors that are concatenated. For SD3, an additional T5 encoder is used.

Key considerations:

  • Textual inversion embeddings are loaded from the embeddings directory and matched by token name
  • Multiple CLIP chunks are generated when prompts exceed 77 tokens
  • Emphasis modes include Original, A1111, and Normalize-and-rescale strategies

Step 5: Latent initialization and sampling

Initialize the starting latent tensor from random noise using the configured seed and RNG source (GPU, CPU, or NV). Execute the sampling loop: the chosen sampler iteratively denoises the latent over the configured number of steps, guided by the encoded prompt conditioning via Classifier-Free Guidance. The CFG denoiser combines unconditional and conditional predictions at each step according to the CFG scale.

Key considerations:

  • Seed subseed blending allows interpolation between two noise patterns
  • RNG source affects reproducibility across different hardware
  • Extra networks (LoRA, Hypernetworks) modify model weights before sampling begins

Step 6: High resolution fix (optional)

If enabled, the initial low-resolution output is upscaled using a selected upscaler (Latent, Lanczos, ESRGAN, etc.) to a higher target resolution. The upscaled image is then re-encoded to latent space and denoised again with a configurable denoising strength, separate sampler, and optional separate HR prompts. This two-pass approach produces sharper high-resolution images while avoiding composition artifacts from generating directly at high resolution.

Key considerations:

  • Denoising strength for HR pass controls how much the upscaled image is modified
  • A separate sampler and scheduler can be configured for the HR pass
  • Latent upscale methods operate directly in latent space without VAE round-trip

Step 7: VAE decoding and post-processing

Decode the final denoised latent tensor through the VAE decoder to produce a pixel-space image. Apply optional post-processing: face restoration (CodeFormer or GFPGAN) for improving facial details, and color correction to match the original prompt intent. The generated image is saved with full generation parameter metadata (infotext) embedded in the PNG, enabling exact reproduction of the generation settings.

Key considerations:

  • VAE tiling is used for large images to avoid out-of-memory errors
  • TAESD (Tiny AutoEncoder) provides fast approximate decoding for live previews during generation
  • All generation parameters are serialized into the PNG metadata for reproducibility

Execution Diagram

GitHub URL

Workflow Repository