Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Diffusers Export To Video

From Leeroopedia
Revision as of 13:03, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Diffusers_Export_To_Video.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Field Value
Type API Doc
Overview VAE decoding of video latents via AutoencoderKLWan.decode and export to MP4 via export_to_video
Domains Video Generation, VAE Decoding, Video Export
Workflow Video_Generation
Related Principle Huggingface_Diffusers_Video_Decoding_Export
Source src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1180-L1234, src/diffusers/utils/export_utils.py:L140-L208
Last Updated 2026-02-13 00:00 GMT

Code Reference

AutoencoderKLWan.decode

Source: src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1211-L1234

@apply_forward_hook
def decode(self, z: torch.Tensor, return_dict: bool = True) -> DecoderOutput | torch.Tensor:
    """
    Decode a batch of images.

    Args:
        z: Input batch of latent vectors.
        return_dict: Whether to return a DecoderOutput instead of a plain tuple.
    """
    if self.use_slicing and z.shape[0] > 1:
        decoded_slices = [self._decode(z_slice).sample for z_slice in z.split(1)]
        decoded = torch.cat(decoded_slices)
    else:
        decoded = self._decode(z).sample

    if not return_dict:
        return (decoded,)
    return DecoderOutput(sample=decoded)

AutoencoderKLWan._decode (Frame-by-Frame)

Source: src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1180-L1209

def _decode(self, z: torch.Tensor, return_dict: bool = True):
    _, _, num_frame, height, width = z.shape
    tile_latent_min_height = self.tile_sample_min_height // self.spatial_compression_ratio
    tile_latent_min_width = self.tile_sample_min_width // self.spatial_compression_ratio

    if self.use_tiling and (width > tile_latent_min_width or height > tile_latent_min_height):
        return self.tiled_decode(z, return_dict=return_dict)

    self.clear_cache()
    x = self.post_quant_conv(z)
    for i in range(num_frame):
        self._conv_idx = [0]
        if i == 0:
            out = self.decoder(x[:, :, i:i+1, :, :], feat_cache=self._feat_map,
                             feat_idx=self._conv_idx, first_chunk=True)
        else:
            out_ = self.decoder(x[:, :, i:i+1, :, :], feat_cache=self._feat_map,
                              feat_idx=self._conv_idx)
            out = torch.cat([out, out_], 2)

    out = torch.clamp(out, min=-1.0, max=1.0)
    self.clear_cache()
    if not return_dict:
        return (out,)
    return DecoderOutput(sample=out)

export_to_video

Source: src/diffusers/utils/export_utils.py:L140-L208

def export_to_video(
    video_frames: list[np.ndarray] | list[PIL.Image.Image],
    output_video_path: str = None,
    fps: int = 10,
    quality: float = 5.0,
    bitrate: int | None = None,
    macro_block_size: int | None = 16,
) -> str:
    """Export video frames to an MP4 file using imageio + FFmpeg."""
    if output_video_path is None:
        output_video_path = tempfile.NamedTemporaryFile(suffix=".mp4").name

    if isinstance(video_frames[0], np.ndarray):
        video_frames = [(frame * 255).astype(np.uint8) for frame in video_frames]
    elif isinstance(video_frames[0], PIL.Image.Image):
        video_frames = [np.array(frame) for frame in video_frames]

    with imageio.get_writer(
        output_video_path, fps=fps, quality=quality,
        bitrate=bitrate, macro_block_size=macro_block_size
    ) as writer:
        for frame in video_frames:
            writer.append_data(frame)

    return output_video_path

Import

from diffusers.utils import export_to_video
from diffusers import AutoencoderKLWan

Key Parameters

AutoencoderKLWan.decode

Parameter Type Description
z torch.Tensor (B, z_dim, F, H, W) Latent tensor to decode, e.g., (1, 16, 21, 60, 104)
return_dict bool Return DecoderOutput or raw tuple

export_to_video

Parameter Type Description Default
video_frames list[PIL.Image] Frames to export (required)
output_video_path None Output file path Auto-generated temp file
fps int Frames per second 10
quality float Variable bitrate quality (0-10) 5.0
bitrate None Fixed bitrate (overrides quality) None
macro_block_size None Codec macroblock size constraint 16

I/O Contract

decode

Inputs:

  • z: 5D latent tensor (B, 16, F_latent, H_latent, W_latent)

Outputs:

  • DecoderOutput with .sample: 5D pixel tensor (B, 3, F, H, W) clamped to [-1, 1]
  • Where F = F_latent * scale_factor_temporal (approximately), H = H_latent * 8, W = W_latent * 8

export_to_video

Inputs:

  • List of frames as numpy arrays (shape H, W, 3, values in [0, 1]) or PIL images

Outputs:

  • str: Path to the saved MP4 file

External Dependencies

  • imageio + imageio-ffmpeg (preferred backend)
  • opencv-python (legacy fallback)

Usage Examples

Complete Pipeline with Export

import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video

model_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

output = pipe(
    prompt="A cat and a dog baking a cake together in a kitchen.",
    negative_prompt="blurred, low quality",
    height=720, width=1280, num_frames=81,
    guidance_scale=5.0, num_inference_steps=50,
)

# Export the first batch element's frames to MP4
export_to_video(output.frames[0], "output.mp4", fps=16)

Decoding Latents Manually

# If you have raw latents (e.g., from output_type="latent")
latents = pipe(prompt="...", output_type="latent").frames

# Denormalize latents
latents_mean = torch.tensor(pipe.vae.config.latents_mean).view(1, 16, 1, 1, 1).to(latents.device, latents.dtype)
latents_std = 1.0 / torch.tensor(pipe.vae.config.latents_std).view(1, 16, 1, 1, 1).to(latents.device, latents.dtype)
latents = latents / latents_std + latents_mean

# Decode
video = pipe.vae.decode(latents.to(pipe.vae.dtype), return_dict=False)[0]
frames = pipe.video_processor.postprocess_video(video, output_type="np")
export_to_video(frames[0], "manual_decode.mp4", fps=16)

Related Pages

Principle:Huggingface_Diffusers_Video_Decoding_Export

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment