Implementation:Huggingface Diffusers VideoProcessor

Field	Value
Type	API Doc
Overview	VideoProcessor API for preprocessing input video frames and postprocessing decoded video tensors
Domains	Video Generation, Image Processing
Workflow	Video_Generation
Related Principle	Huggingface_Diffusers_Video_Input_Preparation
Source	`src/diffusers/video_processor.py:L27-L176`
Last Updated	2026-02-13 00:00 GMT

Code Reference

VideoProcessor Class

Source: src/diffusers/video_processor.py:L25-L176

class VideoProcessor(VaeImageProcessor):
    """Simple video processor."""

    def preprocess_video(self, video, height: int | None = None, width: int | None = None) -> torch.Tensor:
        """Preprocesses input video(s)."""
        # Handle deprecated 5D list inputs
        if isinstance(video, list) and isinstance(video[0], np.ndarray) and video[0].ndim == 5:
            video = np.concatenate(video, axis=0)
        if isinstance(video, list) and isinstance(video[0], torch.Tensor) and video[0].ndim == 5:
            video = torch.cat(video, axis=0)

        # Normalize to list of videos
        if isinstance(video, (np.ndarray, torch.Tensor)) and video.ndim == 5:
            video = list(video)
        elif isinstance(video, list) and is_valid_image(video[0]) or is_valid_image_imagelist(video):
            video = [video]
        elif isinstance(video, list) and is_valid_image_imagelist(video[0]):
            video = video

        # Preprocess each video and stack
        video = torch.stack([self.preprocess(img, height=height, width=width) for img in video], dim=0)
        video = video.permute(0, 2, 1, 3, 4)  # (B, C, F, H, W)
        return video

    def postprocess_video(self, video: torch.Tensor, output_type: str = "np"):
        """Converts a video tensor to a list of frames for export."""
        batch_size = video.shape[0]
        outputs = []
        for batch_idx in range(batch_size):
            batch_vid = video[batch_idx].permute(1, 0, 2, 3)  # (F, C, H, W)
            batch_output = self.postprocess(batch_vid, output_type)
            outputs.append(batch_output)

        if output_type == "np":
            outputs = np.stack(outputs)
        elif output_type == "pt":
            outputs = torch.stack(outputs)
        return outputs

resize_and_crop_tensor

Source: src/diffusers/video_processor.py:L134-L176

@staticmethod
def resize_and_crop_tensor(samples: torch.Tensor, new_width: int, new_height: int) -> torch.Tensor:
    """Resizes and crops a tensor of videos to the specified dimensions."""
    orig_height, orig_width = samples.shape[3], samples.shape[4]
    if orig_height != new_height or orig_width != new_width:
        ratio = max(new_height / orig_height, new_width / orig_width)
        resized_width = int(orig_width * ratio)
        resized_height = int(orig_height * ratio)
        n, c, t, h, w = samples.shape
        samples = samples.permute(0, 2, 1, 3, 4).reshape(n * t, c, h, w)
        samples = F.interpolate(samples, size=(resized_height, resized_width), mode="bilinear", align_corners=False)
        # Center crop
        start_x = (resized_width - new_width) // 2
        start_y = (resized_height - new_height) // 2
        samples = samples[:, :, start_y:start_y + new_height, start_x:start_x + new_width]
        samples = samples.reshape(n, t, c, new_height, new_width).permute(0, 2, 1, 3, 4)
    return samples

Import

from diffusers.video_processor import VideoProcessor

Key Parameters

Parameter	Description	Default
`vae_scale_factor`	Spatial scale factor from VAE config; inherited from `VaeImageProcessor`	`8`
`do_resize`	Whether to resize input frames	`True`
`do_normalize`	Whether to normalize pixel values to [-1, 1]	`True`

I/O Contract

preprocess_video

Inputs:

video: One of:
- list[PIL.Image] - Single video as list of frames
- list[list[PIL.Image]] - Batch of videos
- torch.Tensor (4D: F,C,H,W or 5D: B,F,C,H,W)
- np.ndarray (4D: F,H,W,C or 5D: B,F,H,W,C)
height (int | None): Target height
width (int | None): Target width

Outputs:

torch.Tensor of shape (B, C, F, H, W) with values in [-1, 1]

postprocess_video

Inputs:

video: torch.Tensor of shape (B, C, F, H, W)
output_type: "np", "pt", or "pil"

Outputs:

If "np": np.ndarray of shape (B, F, H, W, C) with values in [0, 1]
If "pt": torch.Tensor
If "pil": list[list[PIL.Image]]

Usage Examples

Preprocessing a Reference Image for Image-to-Video

from diffusers.video_processor import VideoProcessor
from PIL import Image

processor = VideoProcessor(vae_scale_factor=8)
image = Image.open("reference.png").convert("RGB")

# Wrap single image as single-frame video
video_tensor = processor.preprocess_video([image], height=480, width=832)
# Shape: (1, 3, 1, 480, 832)

Postprocessing Decoded Video for Export

# After pipeline decoding:
# decoded_video shape: (1, 3, 81, 480, 832), values in [-1, 1]
frames = processor.postprocess_video(decoded_video, output_type="np")
# frames shape: (1, 81, 480, 832, 3), values in [0, 1]

# For PIL output:
pil_frames = processor.postprocess_video(decoded_video, output_type="pil")
# pil_frames: list of list of PIL.Image

Resizing and Cropping Video Tensors

# Resize a (B, C, F, H, W) tensor to target dimensions
resized = VideoProcessor.resize_and_crop_tensor(video_tensor, new_width=1280, new_height=720)

Related Pages

Huggingface_Diffusers_Video_Input_Preparation (principle for this implementation) - Theory of video preprocessing
Huggingface_Diffusers_Video_Pipeline_From_Pretrained (creates VideoProcessor) - Pipeline initialization creates the processor
Huggingface_Diffusers_Export_To_Video (consumes output) - Export uses postprocessed frames

Principle:Huggingface_Diffusers_Video_Input_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment