Implementation:Zai org CogVideo Get Video Frames

Attribute	Value
Implementation Name	Get Video Frames
Workflow	Video Editing DDIM Inversion
Step	1 of 6
Type	API Doc
Source File	`inference/ddim_inversion.py:L263-300`
Repository	zai-org/CogVideo
External Dependencies	decord, torchvision.transforms
Last Updated	2026-02-10 00:00 GMT

Overview

Implementation of video loading, frame sampling, resizing, and normalization for the DDIM inversion pipeline. The get_video_frames function produces a tensor of preprocessed frames ready for VAE encoding.

Description

The get_video_frames function performs:

Loads the video file using decord's VideoReader
Applies start/end frame skipping
Samples frames to the target count using uniform stepping or automatic stride calculation
Resizes frames to the target resolution using torchvision transforms
Normalizes pixel values from [0, 255] to [-1, 1]

The function enforces the VAE constraint that the frame count must satisfy (F mod 4) == 1.

Usage

from inference.ddim_inversion import get_video_frames

video_frames = get_video_frames(
    video_path="input_video.mp4",
    width=720,
    height=480,
    max_num_frames=81,
)
# video_frames shape: [F, C, H, W] in [-1, 1]

Code Reference

Source Location

File	Lines	Description
`inference/ddim_inversion.py`	L263-300	`get_video_frames` function

Signature

def get_video_frames(
    video_path: str,
    width: int = 720,
    height: int = 480,
    skip_frames_start: int = 0,
    skip_frames_end: int = 0,
    max_num_frames: int = 81,
    frame_sample_step: Optional[int] = None,
) -> torch.FloatTensor:  # [F, C, H, W] in [-1, 1]

Import

from inference.ddim_inversion import get_video_frames

I/O Contract

Inputs

Parameter	Type	Default	Description
`video_path`	`str`	Required	Path to the input video file
`width`	`int`	`720`	Target width for resizing
`height`	`int`	`480`	Target height for resizing
`skip_frames_start`	`int`	`0`	Number of frames to skip at the beginning
`skip_frames_end`	`int`	`0`	Number of frames to skip at the end
`max_num_frames`	`int`	`81`	Maximum number of frames to sample (must satisfy `F mod 4 == 1`)
`frame_sample_step`	`Optional[int]`	`None`	Explicit frame sampling step; if None, computed automatically

Outputs

Output	Type	Description
Return value	`torch.FloatTensor`	Video frames tensor of shape `[F, C, H, W]` with values in `[-1, 1]`

Usage Examples

Example 1: Default loading

from inference.ddim_inversion import get_video_frames

frames = get_video_frames("input.mp4")
# frames.shape: [81, 3, 480, 720]
# frames.dtype: torch.float32
# frames.min() >= -1.0, frames.max() <= 1.0

Example 2: Custom resolution and frame count

frames = get_video_frames(
    "input.mp4",
    width=1360,
    height=768,
    max_num_frames=49,
    skip_frames_start=10,
    skip_frames_end=5,
)
# frames.shape: [49, 3, 768, 1360]

Example 3: Explicit frame sampling step

frames = get_video_frames(
    "input.mp4",
    frame_sample_step=3,  # Take every 3rd frame
    max_num_frames=25,
)

Related Pages

Principle:Zai_org_CogVideo_Video_Loading_and_Preprocessing -- Principle governing video loading and preprocessing
Environment:Zai_org_CogVideo_Diffusers_Inference_Environment
Zai_org_CogVideo_Encode_Video_Frames -- Next step: encoding frames to latent space
Zai_org_CogVideo_DDIM_CogVideoXPipeline_From_Pretrained -- Pipeline providing the VAE for subsequent encoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment