Implementation:Zai org CogVideo Caption Load Video
Appearance
| Attribute | Value |
|---|---|
| Implementation Name | Caption Load Video |
| Workflow | Video Captioning |
| Step | 3 of 5 |
| Type | API Doc |
| Source File | tools/caption/video_caption.py:L25-57
|
| Repository | zai-org/CogVideo |
| External Dependencies | decord, numpy |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implementation of video frame extraction for the captioning pipeline. The load_video function extracts a fixed number of representative frames from raw video bytes using configurable temporal sampling strategies.
Description
The load_video function:
- Creates a decord
VideoReaderfrom the raw video bytes - Determines the total number of frames in the video
- Selects frame indices based on the chosen strategy:
- Chat mode: Computes 1-FPS sampling indices up to 24 frames
- Base mode: Computes uniform sampling indices for exactly 24 frames
- Extracts the selected frames using decord's batch frame access
- Returns the frames as a tensor in
[C, T, H, W]format
The function accepts raw video bytes rather than a file path, enabling in-memory video processing without intermediate file I/O.
Usage
from tools.caption.video_caption import load_video
with open("video.mp4", "rb") as f:
video_data = f.read()
frames = load_video(video_data, strategy="chat")
# frames.shape: [3, 24, H, W]
Code Reference
Source Location
| File | Lines | Description |
|---|---|---|
tools/caption/video_caption.py |
L25-57 | load_video function
|
Signature
def load_video(
video_data: bytes,
strategy: str = "chat" # "chat" or "base"
) -> torch.Tensor: # [C, T, H, W]
Import
from tools.caption.video_caption import load_video
I/O Contract
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
video_data |
bytes |
Required | Raw video file bytes (e.g., from open("video.mp4", "rb").read())
|
strategy |
str |
"chat" |
Sampling strategy: "chat" (1 FPS up to 24 frames) or "base" (uniform 24 frames)
|
Outputs
| Output | Type | Description |
|---|---|---|
| Return value | torch.Tensor |
Video frames tensor of shape [C, T, H, W] where C=3 (RGB), T<=24, and H, W are the original video dimensions
|
Usage Examples
Example 1: Chat mode (1 FPS sampling)
from tools.caption.video_caption import load_video
with open("natural_scene.mp4", "rb") as f:
video_data = f.read()
frames = load_video(video_data, strategy="chat")
print(f"Extracted {frames.shape[1]} frames")
# For a 30-second video: 24 frames (capped)
# For a 10-second video: 10 frames
Example 2: Base mode (uniform sampling)
frames = load_video(video_data, strategy="base")
print(f"Extracted {frames.shape[1]} frames")
# Always 24 frames regardless of video length
Example 3: Integration with prediction
from tools.caption.video_caption import load_video, predict
with open("input_video.mp4", "rb") as f:
video_data = f.read()
# load_video is called internally by predict()
caption = predict(
prompt="Please describe this video in detail.",
video_data=video_data,
temperature=0.1
)
print(caption)
Related Pages
- Principle:Zai_org_CogVideo_Video_Frame_Extraction -- Principle governing video frame extraction
- Environment:Zai_org_CogVideo_Video_Captioning_Environment
- Heuristic:Zai_org_CogVideo_Decord_Import_Order_Bug
- Zai_org_CogVideo_CogVLM2_Model_Loading -- Previous step: model loading
- Zai_org_CogVideo_CogVLM2_Predict -- Next step: caption generation using extracted frames
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment