Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Zai org CogVideo Caption Load Video

From Leeroopedia


Attribute Value
Implementation Name Caption Load Video
Workflow Video Captioning
Step 3 of 5
Type API Doc
Source File tools/caption/video_caption.py:L25-57
Repository zai-org/CogVideo
External Dependencies decord, numpy
Last Updated 2026-02-10 00:00 GMT

Overview

Implementation of video frame extraction for the captioning pipeline. The load_video function extracts a fixed number of representative frames from raw video bytes using configurable temporal sampling strategies.

Description

The load_video function:

  1. Creates a decord VideoReader from the raw video bytes
  2. Determines the total number of frames in the video
  3. Selects frame indices based on the chosen strategy:
    • Chat mode: Computes 1-FPS sampling indices up to 24 frames
    • Base mode: Computes uniform sampling indices for exactly 24 frames
  4. Extracts the selected frames using decord's batch frame access
  5. Returns the frames as a tensor in [C, T, H, W] format

The function accepts raw video bytes rather than a file path, enabling in-memory video processing without intermediate file I/O.

Usage

from tools.caption.video_caption import load_video

with open("video.mp4", "rb") as f:
    video_data = f.read()

frames = load_video(video_data, strategy="chat")
# frames.shape: [3, 24, H, W]

Code Reference

Source Location

File Lines Description
tools/caption/video_caption.py L25-57 load_video function

Signature

def load_video(
    video_data: bytes,
    strategy: str = "chat"  # "chat" or "base"
) -> torch.Tensor:  # [C, T, H, W]

Import

from tools.caption.video_caption import load_video

I/O Contract

Inputs

Parameter Type Default Description
video_data bytes Required Raw video file bytes (e.g., from open("video.mp4", "rb").read())
strategy str "chat" Sampling strategy: "chat" (1 FPS up to 24 frames) or "base" (uniform 24 frames)

Outputs

Output Type Description
Return value torch.Tensor Video frames tensor of shape [C, T, H, W] where C=3 (RGB), T<=24, and H, W are the original video dimensions

Usage Examples

Example 1: Chat mode (1 FPS sampling)

from tools.caption.video_caption import load_video

with open("natural_scene.mp4", "rb") as f:
    video_data = f.read()

frames = load_video(video_data, strategy="chat")
print(f"Extracted {frames.shape[1]} frames")
# For a 30-second video: 24 frames (capped)
# For a 10-second video: 10 frames

Example 2: Base mode (uniform sampling)

frames = load_video(video_data, strategy="base")
print(f"Extracted {frames.shape[1]} frames")
# Always 24 frames regardless of video length

Example 3: Integration with prediction

from tools.caption.video_caption import load_video, predict

with open("input_video.mp4", "rb") as f:
    video_data = f.read()

# load_video is called internally by predict()
caption = predict(
    prompt="Please describe this video in detail.",
    video_data=video_data,
    temperature=0.1
)
print(caption)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment