Implementation:Zai org CogVideo Caption Load Video

Attribute	Value
Implementation Name	Caption Load Video
Workflow	Video Captioning
Step	3 of 5
Type	API Doc
Source File	`tools/caption/video_caption.py:L25-57`
Repository	zai-org/CogVideo
External Dependencies	decord, numpy
Last Updated	2026-02-10 00:00 GMT

Overview

Implementation of video frame extraction for the captioning pipeline. The load_video function extracts a fixed number of representative frames from raw video bytes using configurable temporal sampling strategies.

Description

The load_video function:

Creates a decord VideoReader from the raw video bytes
Determines the total number of frames in the video
Selects frame indices based on the chosen strategy:
- Chat mode: Computes 1-FPS sampling indices up to 24 frames
- Base mode: Computes uniform sampling indices for exactly 24 frames
Extracts the selected frames using decord's batch frame access
Returns the frames as a tensor in [C, T, H, W] format

The function accepts raw video bytes rather than a file path, enabling in-memory video processing without intermediate file I/O.

Usage

from tools.caption.video_caption import load_video

with open("video.mp4", "rb") as f:
    video_data = f.read()

frames = load_video(video_data, strategy="chat")
# frames.shape: [3, 24, H, W]

Code Reference

Source Location

File	Lines	Description
`tools/caption/video_caption.py`	L25-57	`load_video` function

Signature

def load_video(
    video_data: bytes,
    strategy: str = "chat"  # "chat" or "base"
) -> torch.Tensor:  # [C, T, H, W]

Import

from tools.caption.video_caption import load_video

I/O Contract

Inputs

Parameter	Type	Default	Description
`video_data`	`bytes`	Required	Raw video file bytes (e.g., from `open("video.mp4", "rb").read()`)
`strategy`	`str`	`"chat"`	Sampling strategy: `"chat"` (1 FPS up to 24 frames) or `"base"` (uniform 24 frames)

Outputs

Output	Type	Description
Return value	`torch.Tensor`	Video frames tensor of shape `[C, T, H, W]` where `C=3` (RGB), `T<=24`, and `H, W` are the original video dimensions

Usage Examples

Example 1: Chat mode (1 FPS sampling)

from tools.caption.video_caption import load_video

with open("natural_scene.mp4", "rb") as f:
    video_data = f.read()

frames = load_video(video_data, strategy="chat")
print(f"Extracted {frames.shape[1]} frames")
# For a 30-second video: 24 frames (capped)
# For a 10-second video: 10 frames

Example 2: Base mode (uniform sampling)

frames = load_video(video_data, strategy="base")
print(f"Extracted {frames.shape[1]} frames")
# Always 24 frames regardless of video length

Example 3: Integration with prediction

from tools.caption.video_caption import load_video, predict

with open("input_video.mp4", "rb") as f:
    video_data = f.read()

# load_video is called internally by predict()
caption = predict(
    prompt="Please describe this video in detail.",
    video_data=video_data,
    temperature=0.1
)
print(caption)

Related Pages

Principle:Zai_org_CogVideo_Video_Frame_Extraction -- Principle governing video frame extraction
Environment:Zai_org_CogVideo_Video_Captioning_Environment
Heuristic:Zai_org_CogVideo_Decord_Import_Order_Bug
Zai_org_CogVideo_CogVLM2_Model_Loading -- Previous step: model loading
Zai_org_CogVideo_CogVLM2_Predict -- Next step: caption generation using extracted frames

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment