Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Diffusers Video Input Preparation

From Leeroopedia
Property Value
Principle Name Video Input Preparation
Overview Preprocessing video frames and reference images for video generation pipelines, including normalization, resizing, and format conversion
Domains Video Generation, Image Processing
Related Implementation Huggingface_Diffusers_VideoProcessor
Knowledge Sources Repo (https://github.com/huggingface/diffusers), Source (src/diffusers/video_processor.py:L27-L160)
Last Updated 2026-02-13 00:00 GMT

Description

Video generation pipelines require inputs in a specific tensor format that matches the VAE's expected input shape. The VideoProcessor class handles the bidirectional conversion between user-friendly formats (PIL images, numpy arrays) and the internal (B, C, F, H, W) tensor format used by the pipeline's VAE and transformer.

The two core operations are:

  1. Preprocessing (preprocess_video) - Converts diverse input formats to a normalized torch.Tensor with shape (batch, channels, frames, height, width) and values in [-1, 1].
  2. Postprocessing (postprocess_video) - Converts the decoded video tensor back to user-requested formats (numpy, PIL, or torch).

Theoretical Basis

Input Format Normalization

The VideoProcessor accepts a wide variety of input types:

  • List of PIL Images - Each image is one frame; the list is one video
  • List of lists of PIL Images - Multiple videos as batch
  • 4D Torch tensors - Shape (F, C, H, W), one video
  • 4D NumPy arrays - Shape (F, H, W, C), one video
  • 5D tensors/arrays - Shape (B, F, C, H, W) or (B, F, H, W, C), batched videos

All inputs are unified to 5D tensors with shape (B, C, F, H, W) through:

  1. Detecting and handling batch vs. single video inputs
  2. Stacking frames via the parent class VaeImageProcessor.preprocess() which handles per-frame resizing, center cropping, and normalization to [-1, 1]
  3. Permuting dimensions: after stacking, the channels are moved before the frame dimension with video.permute(0, 2, 1, 3, 4)

VAE-Compatible Resizing

The VideoProcessor inherits vae_scale_factor from the pipeline's VAE configuration. All spatial dimensions must be divisible by this scale factor (typically 8). The processor uses bilinear interpolation for resizing, followed by center cropping to ensure exact dimension compliance. The pipeline itself enforces additional constraints:

  • Wan: Height and width must be divisible by 16, and further by vae_scale_factor_spatial * patch_size
  • HunyuanVideo: Height and width must be divisible by 16
  • CogVideoX: Height and width must be divisible by 8

Temporal Consistency

Frame count must satisfy temporal compression requirements:

  • num_latent_frames = (num_frames - 1) // vae_scale_factor_temporal + 1
  • For Wan: num_frames must satisfy (num_frames - 1) % 4 == 0 (e.g., 81 frames -> 21 latent frames)
  • The pipeline automatically rounds to the nearest valid frame count if the constraint is not met

Postprocessing

After VAE decoding, the output tensor has shape (B, C, F, H, W) with values in [-1, 1]. Postprocessing:

  1. Permutes each batch element to (F, C, H, W)
  2. Delegates to VaeImageProcessor.postprocess() for per-frame denormalization and format conversion
  3. Returns numpy arrays (default), PIL images, or torch tensors based on output_type

Usage

Input preparation is typically handled automatically by the pipeline's __call__ method. Direct use of VideoProcessor is needed for:

  1. Image-to-video pipelines where a reference image must be preprocessed to match the VAE input format
  2. Custom pipelines where video frames are manipulated between steps
  3. Video-to-video workflows where an input video needs to be encoded to latent space

Related Pages

Implementation:Huggingface_Diffusers_VideoProcessor

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment