Implementation:Huggingface Diffusers Export To Video
Appearance
| Field | Value |
|---|---|
| Type | API Doc |
| Overview | VAE decoding of video latents via AutoencoderKLWan.decode and export to MP4 via export_to_video
|
| Domains | Video Generation, VAE Decoding, Video Export |
| Workflow | Video_Generation |
| Related Principle | Huggingface_Diffusers_Video_Decoding_Export |
| Source | src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1180-L1234, src/diffusers/utils/export_utils.py:L140-L208
|
| Last Updated | 2026-02-13 00:00 GMT |
Code Reference
AutoencoderKLWan.decode
Source: src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1211-L1234
@apply_forward_hook
def decode(self, z: torch.Tensor, return_dict: bool = True) -> DecoderOutput | torch.Tensor:
"""
Decode a batch of images.
Args:
z: Input batch of latent vectors.
return_dict: Whether to return a DecoderOutput instead of a plain tuple.
"""
if self.use_slicing and z.shape[0] > 1:
decoded_slices = [self._decode(z_slice).sample for z_slice in z.split(1)]
decoded = torch.cat(decoded_slices)
else:
decoded = self._decode(z).sample
if not return_dict:
return (decoded,)
return DecoderOutput(sample=decoded)
AutoencoderKLWan._decode (Frame-by-Frame)
Source: src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1180-L1209
def _decode(self, z: torch.Tensor, return_dict: bool = True):
_, _, num_frame, height, width = z.shape
tile_latent_min_height = self.tile_sample_min_height // self.spatial_compression_ratio
tile_latent_min_width = self.tile_sample_min_width // self.spatial_compression_ratio
if self.use_tiling and (width > tile_latent_min_width or height > tile_latent_min_height):
return self.tiled_decode(z, return_dict=return_dict)
self.clear_cache()
x = self.post_quant_conv(z)
for i in range(num_frame):
self._conv_idx = [0]
if i == 0:
out = self.decoder(x[:, :, i:i+1, :, :], feat_cache=self._feat_map,
feat_idx=self._conv_idx, first_chunk=True)
else:
out_ = self.decoder(x[:, :, i:i+1, :, :], feat_cache=self._feat_map,
feat_idx=self._conv_idx)
out = torch.cat([out, out_], 2)
out = torch.clamp(out, min=-1.0, max=1.0)
self.clear_cache()
if not return_dict:
return (out,)
return DecoderOutput(sample=out)
export_to_video
Source: src/diffusers/utils/export_utils.py:L140-L208
def export_to_video(
video_frames: list[np.ndarray] | list[PIL.Image.Image],
output_video_path: str = None,
fps: int = 10,
quality: float = 5.0,
bitrate: int | None = None,
macro_block_size: int | None = 16,
) -> str:
"""Export video frames to an MP4 file using imageio + FFmpeg."""
if output_video_path is None:
output_video_path = tempfile.NamedTemporaryFile(suffix=".mp4").name
if isinstance(video_frames[0], np.ndarray):
video_frames = [(frame * 255).astype(np.uint8) for frame in video_frames]
elif isinstance(video_frames[0], PIL.Image.Image):
video_frames = [np.array(frame) for frame in video_frames]
with imageio.get_writer(
output_video_path, fps=fps, quality=quality,
bitrate=bitrate, macro_block_size=macro_block_size
) as writer:
for frame in video_frames:
writer.append_data(frame)
return output_video_path
Import
from diffusers.utils import export_to_video
from diffusers import AutoencoderKLWan
Key Parameters
AutoencoderKLWan.decode
| Parameter | Type | Description |
|---|---|---|
z |
torch.Tensor (B, z_dim, F, H, W) |
Latent tensor to decode, e.g., (1, 16, 21, 60, 104)
|
return_dict |
bool |
Return DecoderOutput or raw tuple
|
export_to_video
| Parameter | Type | Description | Default |
|---|---|---|---|
video_frames |
list[PIL.Image] | Frames to export | (required) |
output_video_path |
None | Output file path | Auto-generated temp file |
fps |
int |
Frames per second | 10
|
quality |
float |
Variable bitrate quality (0-10) | 5.0
|
bitrate |
None | Fixed bitrate (overrides quality) | None
|
macro_block_size |
None | Codec macroblock size constraint | 16
|
I/O Contract
decode
Inputs:
z: 5D latent tensor(B, 16, F_latent, H_latent, W_latent)
Outputs:
DecoderOutputwith.sample: 5D pixel tensor(B, 3, F, H, W)clamped to[-1, 1]- Where
F = F_latent * scale_factor_temporal(approximately),H = H_latent * 8,W = W_latent * 8
export_to_video
Inputs:
- List of frames as numpy arrays (shape
H, W, 3, values in[0, 1]) or PIL images
Outputs:
str: Path to the saved MP4 file
External Dependencies
imageio+imageio-ffmpeg(preferred backend)opencv-python(legacy fallback)
Usage Examples
Complete Pipeline with Export
import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video
model_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
output = pipe(
prompt="A cat and a dog baking a cake together in a kitchen.",
negative_prompt="blurred, low quality",
height=720, width=1280, num_frames=81,
guidance_scale=5.0, num_inference_steps=50,
)
# Export the first batch element's frames to MP4
export_to_video(output.frames[0], "output.mp4", fps=16)
Decoding Latents Manually
# If you have raw latents (e.g., from output_type="latent")
latents = pipe(prompt="...", output_type="latent").frames
# Denormalize latents
latents_mean = torch.tensor(pipe.vae.config.latents_mean).view(1, 16, 1, 1, 1).to(latents.device, latents.dtype)
latents_std = 1.0 / torch.tensor(pipe.vae.config.latents_std).view(1, 16, 1, 1, 1).to(latents.device, latents.dtype)
latents = latents / latents_std + latents_mean
# Decode
video = pipe.vae.decode(latents.to(pipe.vae.dtype), return_dict=False)[0]
frames = pipe.video_processor.postprocess_video(video, output_type="np")
export_to_video(frames[0], "manual_decode.mp4", fps=16)
Related Pages
- Huggingface_Diffusers_Video_Decoding_Export (principle for this implementation) - Theory of VAE decoding and video export
- Huggingface_Diffusers_WanTransformer3DModel_Forward (produces latents) - Denoising produces the latents decoded here
- Huggingface_Diffusers_Video_Memory_Setup (configures tiling) - Tiling must be enabled before decoding
- Huggingface_Diffusers_VideoProcessor (postprocessing) - Handles format conversion after decoding
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment