Principle:Huggingface Diffusers Video Input Preparation
| Property | Value |
|---|---|
| Principle Name | Video Input Preparation |
| Overview | Preprocessing video frames and reference images for video generation pipelines, including normalization, resizing, and format conversion |
| Domains | Video Generation, Image Processing |
| Related Implementation | Huggingface_Diffusers_VideoProcessor |
| Knowledge Sources | Repo (https://github.com/huggingface/diffusers), Source (src/diffusers/video_processor.py:L27-L160)
|
| Last Updated | 2026-02-13 00:00 GMT |
Description
Video generation pipelines require inputs in a specific tensor format that matches the VAE's expected input shape. The VideoProcessor class handles the bidirectional conversion between user-friendly formats (PIL images, numpy arrays) and the internal (B, C, F, H, W) tensor format used by the pipeline's VAE and transformer.
The two core operations are:
- Preprocessing (
preprocess_video) - Converts diverse input formats to a normalizedtorch.Tensorwith shape(batch, channels, frames, height, width)and values in[-1, 1]. - Postprocessing (
postprocess_video) - Converts the decoded video tensor back to user-requested formats (numpy, PIL, or torch).
Theoretical Basis
Input Format Normalization
The VideoProcessor accepts a wide variety of input types:
- List of PIL Images - Each image is one frame; the list is one video
- List of lists of PIL Images - Multiple videos as batch
- 4D Torch tensors - Shape
(F, C, H, W), one video - 4D NumPy arrays - Shape
(F, H, W, C), one video - 5D tensors/arrays - Shape
(B, F, C, H, W)or(B, F, H, W, C), batched videos
All inputs are unified to 5D tensors with shape (B, C, F, H, W) through:
- Detecting and handling batch vs. single video inputs
- Stacking frames via the parent class
VaeImageProcessor.preprocess()which handles per-frame resizing, center cropping, and normalization to[-1, 1] - Permuting dimensions: after stacking, the channels are moved before the frame dimension with
video.permute(0, 2, 1, 3, 4)
VAE-Compatible Resizing
The VideoProcessor inherits vae_scale_factor from the pipeline's VAE configuration. All spatial dimensions must be divisible by this scale factor (typically 8). The processor uses bilinear interpolation for resizing, followed by center cropping to ensure exact dimension compliance. The pipeline itself enforces additional constraints:
- Wan: Height and width must be divisible by 16, and further by
vae_scale_factor_spatial * patch_size - HunyuanVideo: Height and width must be divisible by 16
- CogVideoX: Height and width must be divisible by 8
Temporal Consistency
Frame count must satisfy temporal compression requirements:
num_latent_frames = (num_frames - 1) // vae_scale_factor_temporal + 1- For Wan:
num_framesmust satisfy(num_frames - 1) % 4 == 0(e.g., 81 frames -> 21 latent frames) - The pipeline automatically rounds to the nearest valid frame count if the constraint is not met
Postprocessing
After VAE decoding, the output tensor has shape (B, C, F, H, W) with values in [-1, 1]. Postprocessing:
- Permutes each batch element to
(F, C, H, W) - Delegates to
VaeImageProcessor.postprocess()for per-frame denormalization and format conversion - Returns numpy arrays (default), PIL images, or torch tensors based on
output_type
Usage
Input preparation is typically handled automatically by the pipeline's __call__ method. Direct use of VideoProcessor is needed for:
- Image-to-video pipelines where a reference image must be preprocessed to match the VAE input format
- Custom pipelines where video frames are manipulated between steps
- Video-to-video workflows where an input video needs to be encoded to latent space
Related Pages
- Huggingface_Diffusers_VideoProcessor (implements this principle) - Concrete VideoProcessor API
- Huggingface_Diffusers_Video_Pipeline_Selection (prerequisite) - Pipeline determines the VAE scale factor
- Huggingface_Diffusers_Video_Denoising (next step) - Denoising operates on the prepared latent tensors
- Huggingface_Diffusers_Video_Decoding_Export (uses postprocessing) - Postprocessing converts decoded tensors to export format