Implementation:Huggingface Diffusers Export To Video

Field	Value
Type	API Doc
Overview	VAE decoding of video latents via `AutoencoderKLWan.decode` and export to MP4 via `export_to_video`
Domains	Video Generation, VAE Decoding, Video Export
Workflow	Video_Generation
Related Principle	Huggingface_Diffusers_Video_Decoding_Export
Source	`src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1180-L1234`, `src/diffusers/utils/export_utils.py:L140-L208`
Last Updated	2026-02-13 00:00 GMT

Code Reference

AutoencoderKLWan.decode

Source: src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1211-L1234

@apply_forward_hook
def decode(self, z: torch.Tensor, return_dict: bool = True) -> DecoderOutput | torch.Tensor:
    """
    Decode a batch of images.

    Args:
        z: Input batch of latent vectors.
        return_dict: Whether to return a DecoderOutput instead of a plain tuple.
    """
    if self.use_slicing and z.shape[0] > 1:
        decoded_slices = [self._decode(z_slice).sample for z_slice in z.split(1)]
        decoded = torch.cat(decoded_slices)
    else:
        decoded = self._decode(z).sample

    if not return_dict:
        return (decoded,)
    return DecoderOutput(sample=decoded)

AutoencoderKLWan._decode (Frame-by-Frame)

Source: src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1180-L1209

def _decode(self, z: torch.Tensor, return_dict: bool = True):
    _, _, num_frame, height, width = z.shape
    tile_latent_min_height = self.tile_sample_min_height // self.spatial_compression_ratio
    tile_latent_min_width = self.tile_sample_min_width // self.spatial_compression_ratio

    if self.use_tiling and (width > tile_latent_min_width or height > tile_latent_min_height):
        return self.tiled_decode(z, return_dict=return_dict)

    self.clear_cache()
    x = self.post_quant_conv(z)
    for i in range(num_frame):
        self._conv_idx = [0]
        if i == 0:
            out = self.decoder(x[:, :, i:i+1, :, :], feat_cache=self._feat_map,
                             feat_idx=self._conv_idx, first_chunk=True)
        else:
            out_ = self.decoder(x[:, :, i:i+1, :, :], feat_cache=self._feat_map,
                              feat_idx=self._conv_idx)
            out = torch.cat([out, out_], 2)

    out = torch.clamp(out, min=-1.0, max=1.0)
    self.clear_cache()
    if not return_dict:
        return (out,)
    return DecoderOutput(sample=out)

export_to_video

Source: src/diffusers/utils/export_utils.py:L140-L208

def export_to_video(
    video_frames: list[np.ndarray] | list[PIL.Image.Image],
    output_video_path: str = None,
    fps: int = 10,
    quality: float = 5.0,
    bitrate: int | None = None,
    macro_block_size: int | None = 16,
) -> str:
    """Export video frames to an MP4 file using imageio + FFmpeg."""
    if output_video_path is None:
        output_video_path = tempfile.NamedTemporaryFile(suffix=".mp4").name

    if isinstance(video_frames[0], np.ndarray):
        video_frames = [(frame * 255).astype(np.uint8) for frame in video_frames]
    elif isinstance(video_frames[0], PIL.Image.Image):
        video_frames = [np.array(frame) for frame in video_frames]

    with imageio.get_writer(
        output_video_path, fps=fps, quality=quality,
        bitrate=bitrate, macro_block_size=macro_block_size
    ) as writer:
        for frame in video_frames:
            writer.append_data(frame)

    return output_video_path

Import

from diffusers.utils import export_to_video
from diffusers import AutoencoderKLWan

Key Parameters

AutoencoderKLWan.decode

Parameter	Type	Description
`z`	`torch.Tensor (B, z_dim, F, H, W)`	Latent tensor to decode, e.g., `(1, 16, 21, 60, 104)`
`return_dict`	`bool`	Return `DecoderOutput` or raw tuple

export_to_video

Parameter	Type	Description	Default
`video_frames`	list[PIL.Image]	Frames to export	(required)
`output_video_path`	None	Output file path	Auto-generated temp file
`fps`	`int`	Frames per second	`10`
`quality`	`float`	Variable bitrate quality (0-10)	`5.0`
`bitrate`	None	Fixed bitrate (overrides quality)	`None`
`macro_block_size`	None	Codec macroblock size constraint	`16`

I/O Contract

decode

Inputs:

z: 5D latent tensor (B, 16, F_latent, H_latent, W_latent)

Outputs:

DecoderOutput with .sample: 5D pixel tensor (B, 3, F, H, W) clamped to [-1, 1]
Where F = F_latent * scale_factor_temporal (approximately), H = H_latent * 8, W = W_latent * 8

export_to_video

Inputs:

List of frames as numpy arrays (shape H, W, 3, values in [0, 1]) or PIL images

Outputs:

str: Path to the saved MP4 file

External Dependencies

imageio + imageio-ffmpeg (preferred backend)
opencv-python (legacy fallback)

Usage Examples

Complete Pipeline with Export

import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video

model_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

output = pipe(
    prompt="A cat and a dog baking a cake together in a kitchen.",
    negative_prompt="blurred, low quality",
    height=720, width=1280, num_frames=81,
    guidance_scale=5.0, num_inference_steps=50,
)

# Export the first batch element's frames to MP4
export_to_video(output.frames[0], "output.mp4", fps=16)

Decoding Latents Manually

# If you have raw latents (e.g., from output_type="latent")
latents = pipe(prompt="...", output_type="latent").frames

# Denormalize latents
latents_mean = torch.tensor(pipe.vae.config.latents_mean).view(1, 16, 1, 1, 1).to(latents.device, latents.dtype)
latents_std = 1.0 / torch.tensor(pipe.vae.config.latents_std).view(1, 16, 1, 1, 1).to(latents.device, latents.dtype)
latents = latents / latents_std + latents_mean

# Decode
video = pipe.vae.decode(latents.to(pipe.vae.dtype), return_dict=False)[0]
frames = pipe.video_processor.postprocess_video(video, output_type="np")
export_to_video(frames[0], "manual_decode.mp4", fps=16)

Related Pages

Huggingface_Diffusers_Video_Decoding_Export (principle for this implementation) - Theory of VAE decoding and video export
Huggingface_Diffusers_WanTransformer3DModel_Forward (produces latents) - Denoising produces the latents decoded here
Huggingface_Diffusers_Video_Memory_Setup (configures tiling) - Tiling must be enabled before decoding
Huggingface_Diffusers_VideoProcessor (postprocessing) - Handles format conversion after decoding

Principle:Huggingface_Diffusers_Video_Decoding_Export

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment