Principle:Huggingface Diffusers Video Input Preparation

Property	Value
Principle Name	Video Input Preparation
Overview	Preprocessing video frames and reference images for video generation pipelines, including normalization, resizing, and format conversion
Domains	Video Generation, Image Processing
Related Implementation	Huggingface_Diffusers_VideoProcessor
Knowledge Sources	Repo (https://github.com/huggingface/diffusers), Source (`src/diffusers/video_processor.py:L27-L160`)
Last Updated	2026-02-13 00:00 GMT

Description

Video generation pipelines require inputs in a specific tensor format that matches the VAE's expected input shape. The VideoProcessor class handles the bidirectional conversion between user-friendly formats (PIL images, numpy arrays) and the internal (B, C, F, H, W) tensor format used by the pipeline's VAE and transformer.

The two core operations are:

Preprocessing (preprocess_video) - Converts diverse input formats to a normalized torch.Tensor with shape (batch, channels, frames, height, width) and values in [-1, 1].
Postprocessing (postprocess_video) - Converts the decoded video tensor back to user-requested formats (numpy, PIL, or torch).

Theoretical Basis

Input Format Normalization

The VideoProcessor accepts a wide variety of input types:

List of PIL Images - Each image is one frame; the list is one video
List of lists of PIL Images - Multiple videos as batch
4D Torch tensors - Shape (F, C, H, W), one video
4D NumPy arrays - Shape (F, H, W, C), one video
5D tensors/arrays - Shape (B, F, C, H, W) or (B, F, H, W, C), batched videos

All inputs are unified to 5D tensors with shape (B, C, F, H, W) through:

Detecting and handling batch vs. single video inputs
Stacking frames via the parent class VaeImageProcessor.preprocess() which handles per-frame resizing, center cropping, and normalization to [-1, 1]
Permuting dimensions: after stacking, the channels are moved before the frame dimension with video.permute(0, 2, 1, 3, 4)

VAE-Compatible Resizing

The VideoProcessor inherits vae_scale_factor from the pipeline's VAE configuration. All spatial dimensions must be divisible by this scale factor (typically 8). The processor uses bilinear interpolation for resizing, followed by center cropping to ensure exact dimension compliance. The pipeline itself enforces additional constraints:

Wan: Height and width must be divisible by 16, and further by vae_scale_factor_spatial * patch_size
HunyuanVideo: Height and width must be divisible by 16
CogVideoX: Height and width must be divisible by 8

Temporal Consistency

Frame count must satisfy temporal compression requirements:

num_latent_frames = (num_frames - 1) // vae_scale_factor_temporal + 1
For Wan: num_frames must satisfy (num_frames - 1) % 4 == 0 (e.g., 81 frames -> 21 latent frames)
The pipeline automatically rounds to the nearest valid frame count if the constraint is not met

Postprocessing

After VAE decoding, the output tensor has shape (B, C, F, H, W) with values in [-1, 1]. Postprocessing:

Permutes each batch element to (F, C, H, W)
Delegates to VaeImageProcessor.postprocess() for per-frame denormalization and format conversion
Returns numpy arrays (default), PIL images, or torch tensors based on output_type

Usage

Input preparation is typically handled automatically by the pipeline's __call__ method. Direct use of VideoProcessor is needed for:

Image-to-video pipelines where a reference image must be preprocessed to match the VAE input format
Custom pipelines where video frames are manipulated between steps
Video-to-video workflows where an input video needs to be encoded to latent space

Related Pages

Huggingface_Diffusers_VideoProcessor (implements this principle) - Concrete VideoProcessor API
Huggingface_Diffusers_Video_Pipeline_Selection (prerequisite) - Pipeline determines the VAE scale factor
Huggingface_Diffusers_Video_Denoising (next step) - Denoising operates on the prepared latent tensors
Huggingface_Diffusers_Video_Decoding_Export (uses postprocessing) - Postprocessing converts decoded tensors to export format

Implementation:Huggingface_Diffusers_VideoProcessor

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment