Workflow:AUTOMATIC1111 Stable diffusion webui Text to image generation
| Knowledge Sources | |
|---|---|
| Domains | Image_Generation, Stable_Diffusion, Generative_AI |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
End-to-end process for generating images from text prompts using Stable Diffusion models with optional high-resolution upscaling.
Description
This workflow covers the complete text-to-image generation pipeline in the AUTOMATIC1111 WebUI. The user provides a text prompt describing the desired image and a negative prompt specifying undesired elements. The system encodes these prompts into CLIP conditioning vectors, initializes random latent noise from a seed, and iteratively denoises the latent using a selected sampler and noise schedule. The resulting latent is decoded through the VAE into a pixel image. An optional high-resolution fix pass can upscale the initial result and re-denoise it at higher resolution for improved detail.
Usage
Execute this workflow when you have a text description of an image you want to generate and a loaded Stable Diffusion checkpoint (SD 1.x, SD 2.x, SDXL, or SD3). This is the primary generation mode and the most common entry point for users of the WebUI.
Execution Steps
Step 1: Prompt composition
Compose the positive and negative text prompts that describe the desired and undesired image content. The prompt parser supports attention weighting with parentheses (word:1.2), scheduling [from:to:step], alternating tokens [a|b], the BREAK keyword for manual chunk splitting, and composable prompts using AND. Optionally apply saved prompt styles to prepend or append standard prompt fragments.
Key considerations:
- Token limit per chunk is 77 (CLIP); prompts exceeding this are split into multiple chunks automatically
- Attention weighting defaults to 1.0; values above 1.0 increase emphasis, below 1.0 decrease it
- Negative prompts follow the same syntax as positive prompts
Step 2: Parameter configuration
Configure the generation parameters: sampling method (Euler, DPM++ 2M, DDIM, etc.), noise schedule type (Karras, Exponential, Beta, etc.), number of sampling steps, CFG (Classifier-Free Guidance) scale, output image dimensions, batch size, batch count, and the random seed. These parameters control the quality, style, and reproducibility of the output.
Key considerations:
- Higher CFG scale increases prompt adherence but may reduce image quality at extreme values
- More sampling steps generally improve quality with diminishing returns
- Seed value of -1 generates a random seed; fixed seeds enable reproducible results
Step 3: Model and VAE loading
Load the selected Stable Diffusion checkpoint and VAE into GPU memory. The system detects the model architecture (SD 1.x, SD 2.x, SDXL, SD3, or variants like InstructPix2Pix) from the state dict structure and applies the correct configuration. Attention optimizations (xformers, scaled dot product, sub-quadratic) are applied via model hijacking. The VAE can be overridden separately from the checkpoint.
Key considerations:
- Model type is auto-detected from state dict keys and shapes
- Safe tensor loading with class whitelist prevents arbitrary code execution
- Model caching avoids reloading when switching back to previously used checkpoints
Step 4: Prompt encoding
Encode the positive and negative prompts into conditioning tensors using the CLIP text encoder. Prompts are tokenized into 77-token chunks, with textual inversion embeddings injected for any referenced embedding tokens. Attention weights from the prompt syntax are applied to modify token emphasis. For SDXL, both CLIP-L and CLIP-G encoders produce separate conditioning vectors that are concatenated. For SD3, an additional T5 encoder is used.
Key considerations:
- Textual inversion embeddings are loaded from the embeddings directory and matched by token name
- Multiple CLIP chunks are generated when prompts exceed 77 tokens
- Emphasis modes include Original, A1111, and Normalize-and-rescale strategies
Step 5: Latent initialization and sampling
Initialize the starting latent tensor from random noise using the configured seed and RNG source (GPU, CPU, or NV). Execute the sampling loop: the chosen sampler iteratively denoises the latent over the configured number of steps, guided by the encoded prompt conditioning via Classifier-Free Guidance. The CFG denoiser combines unconditional and conditional predictions at each step according to the CFG scale.
Key considerations:
- Seed subseed blending allows interpolation between two noise patterns
- RNG source affects reproducibility across different hardware
- Extra networks (LoRA, Hypernetworks) modify model weights before sampling begins
Step 6: High resolution fix (optional)
If enabled, the initial low-resolution output is upscaled using a selected upscaler (Latent, Lanczos, ESRGAN, etc.) to a higher target resolution. The upscaled image is then re-encoded to latent space and denoised again with a configurable denoising strength, separate sampler, and optional separate HR prompts. This two-pass approach produces sharper high-resolution images while avoiding composition artifacts from generating directly at high resolution.
Key considerations:
- Denoising strength for HR pass controls how much the upscaled image is modified
- A separate sampler and scheduler can be configured for the HR pass
- Latent upscale methods operate directly in latent space without VAE round-trip
Step 7: VAE decoding and post-processing
Decode the final denoised latent tensor through the VAE decoder to produce a pixel-space image. Apply optional post-processing: face restoration (CodeFormer or GFPGAN) for improving facial details, and color correction to match the original prompt intent. The generated image is saved with full generation parameter metadata (infotext) embedded in the PNG, enabling exact reproduction of the generation settings.
Key considerations:
- VAE tiling is used for large images to avoid out-of-memory errors
- TAESD (Tiny AutoEncoder) provides fast approximate decoding for live previews during generation
- All generation parameters are serialized into the PNG metadata for reproducibility