Implementation:Huggingface Diffusers AutoencoderKL Decode

Knowledge Sources	Diffusers Diffusers Docs
Domains	Diffusion_Models, Variational_Autoencoders, Latent_Space
Last Updated	2026-02-13 21:00 GMT

Overview

Concrete tool for decoding a batch of latent tensors into pixel-space images using the KL-regularized Variational Autoencoder provided by the Diffusers library.

Description

AutoencoderKL.decode takes a batch of latent tensors (output of the denoising loop, after unscaling) and passes them through the VAE decoder to reconstruct pixel-space images. The method supports sliced decoding: when use_slicing is enabled and the batch size is greater than 1, each latent in the batch is decoded individually and the results are concatenated. This reduces peak GPU memory usage at the cost of slightly slower throughput.

Internally, the method delegates to _decode, which runs the latent through a post-quantization convolution layer and then the main decoder network. The decoder consists of residual blocks, self-attention layers, and upsampling operations that progressively increase the spatial resolution from latent dimensions to pixel dimensions.

The output is either a DecoderOutput named tuple containing a sample tensor, or a plain tuple containing just the decoded tensor, depending on the return_dict flag.

Usage

This method is called automatically by the pipeline after the denoising loop completes (unless output_type="latent"). Call it directly when implementing custom pipelines, performing latent-space manipulations that need to be visualized, or when you need explicit control over the decoding step. The input latents must already be unscaled (divided by scaling_factor) before being passed to this method.

Code Reference

Source Location

Repository: diffusers
File: src/diffusers/models/autoencoders/autoencoder_kl.py
Lines: 214-248

Signature

def decode(
    self,
    z: torch.FloatTensor,
    return_dict: bool = True,
    generator=None,
) -> DecoderOutput | torch.FloatTensor:

Import

from diffusers import AutoencoderKL

I/O Contract

Inputs

Name	Type	Required	Description
z	`torch.FloatTensor`	Yes	Input batch of latent vectors to decode. Shape: `[batch_size, latent_channels, height, width]` (e.g., `[1, 4, 128, 128]` for SDXL at 1024x1024). Must already be unscaled (divided by `scaling_factor`).
return_dict	`bool`	No	Whether to return a `DecoderOutput` named tuple or a plain tuple. Defaults to `True`.
generator	`torch.Generator`	No	Random number generator for reproducibility. Currently unused in the standard decode path but available for subclasses.

Outputs

Name	Type	Description
sample	`torch.Tensor`	The decoded pixel-space image tensor. Shape: `[batch_size, 3, height * vae_scale_factor, width * vae_scale_factor]` (e.g., `[1, 3, 1024, 1024]`). Values are in the range `[-1, 1]`. Wrapped in `DecoderOutput` if `return_dict=True`, otherwise returned as the first element of a plain tuple.

Usage Examples

Basic Usage

from diffusers import AutoencoderKL
import torch

# Load the SDXL VAE
vae = AutoencoderKL.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    subfolder="vae",
    torch_dtype=torch.float16,
).to("cuda")

# Assume latents come from a denoising loop (shape: [1, 4, 128, 128])
# Unscale the latents first
latents = latents / vae.config.scaling_factor

# Decode to pixel space
with torch.no_grad():
    decoded = vae.decode(latents, return_dict=False)[0]

# decoded shape: [1, 3, 1024, 1024], range [-1, 1]
# Normalize to [0, 1] for visualization
image_tensor = (decoded / 2 + 0.5).clamp(0, 1)

With Sliced Decoding for Memory Efficiency

from diffusers import AutoencoderKL
import torch

vae = AutoencoderKL.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    subfolder="vae",
    torch_dtype=torch.float16,
).to("cuda")

# Enable sliced decoding for large batches
vae.enable_slicing()

# Decode a batch of latents one at a time to save memory
batch_latents = torch.randn(4, 4, 128, 128, device="cuda", dtype=torch.float16)
batch_latents = batch_latents / vae.config.scaling_factor

with torch.no_grad():
    decoded = vae.decode(batch_latents).sample
# Each latent is decoded individually, then concatenated

Inside a Custom Pipeline

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

# Generate latents without decoding
latent_output = pipe(
    "A photo of a mountain landscape",
    output_type="latent",
    num_inference_steps=30,
).images

# Manually unscale and decode
latents = latent_output / pipe.vae.config.scaling_factor
with torch.no_grad():
    image_tensor = pipe.vae.decode(latents, return_dict=False)[0]

# Post-process manually
image = pipe.image_processor.postprocess(image_tensor, output_type="pil")[0]
image.save("mountain.png")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment