Principle:Zai org CogVideo Video Frame Extraction

Attribute	Value
Principle Name	Video Frame Extraction
Workflow	Video Captioning
Step	3 of 5
Type	Data Input
Repository	zai-org/CogVideo
Paper	CogVLM2
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for extracting representative frames from a video for input to a vision-language captioning model. Frame extraction samples a fixed number of frames from a video using configurable temporal sampling strategies.

Description

Frame extraction samples a fixed number of frames (24) from a video using one of two strategies:

Chat mode (strategy="chat"): Samples 1 frame per second up to 24 frames. This strategy is better for natural scenes where temporal events occur at human-perceptible timescales.
Base mode (strategy="base"): Uniformly samples 24 frames across the entire video duration. This strategy provides even temporal coverage regardless of video length.

The frames are loaded from raw video bytes via decord's VideoReader and returned as a tensor in [C, T, H, W] format, suitable for input to the CogVLM2 vision encoder.

Usage

Use Video Frame Extraction after model loading and before caption generation. The extracted frames are passed to the predict function as part of the video data.

Theoretical Basis

Temporal sampling reduces the video to a fixed number of representative frames that capture the essential visual content:

1-frame-per-second sampling (chat mode): Captures temporal events at human-perceptible granularity. Since most meaningful visual changes in natural videos occur at timescales of seconds rather than milliseconds, 1 FPS sampling preserves nearly all semantically relevant information while dramatically reducing computational cost.
Uniform sampling (base mode): Ensures equal representation of all temporal segments regardless of content density. This is more robust for videos with non-uniform temporal distributions of events.

The fixed frame count of 24 balances between:

Sufficient temporal context: 24 frames provide enough temporal information for the model to understand actions, transitions, and temporal relationships.
Computational feasibility: The vision encoder processes each frame independently, so 24 frames represent a manageable computational load.

Related Pages

Implementation:Zai_org_CogVideo_Caption_Load_Video -- Implementation of frame extraction
Zai_org_CogVideo_Caption_Model_Loading -- Previous step: loading the CogVLM2 model
Zai_org_CogVideo_Caption_Generation -- Next step: generating captions from extracted frames

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment