Principle:Zai org CogVideo Video Frame Extraction
| Attribute | Value |
|---|---|
| Principle Name | Video Frame Extraction |
| Workflow | Video Captioning |
| Step | 3 of 5 |
| Type | Data Input |
| Repository | zai-org/CogVideo |
| Paper | CogVLM2 |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for extracting representative frames from a video for input to a vision-language captioning model. Frame extraction samples a fixed number of frames from a video using configurable temporal sampling strategies.
Description
Frame extraction samples a fixed number of frames (24) from a video using one of two strategies:
- Chat mode (
strategy="chat"): Samples 1 frame per second up to 24 frames. This strategy is better for natural scenes where temporal events occur at human-perceptible timescales. - Base mode (
strategy="base"): Uniformly samples 24 frames across the entire video duration. This strategy provides even temporal coverage regardless of video length.
The frames are loaded from raw video bytes via decord's VideoReader and returned as a tensor in [C, T, H, W] format, suitable for input to the CogVLM2 vision encoder.
Usage
Use Video Frame Extraction after model loading and before caption generation. The extracted frames are passed to the predict function as part of the video data.
Theoretical Basis
Temporal sampling reduces the video to a fixed number of representative frames that capture the essential visual content:
- 1-frame-per-second sampling (chat mode): Captures temporal events at human-perceptible granularity. Since most meaningful visual changes in natural videos occur at timescales of seconds rather than milliseconds, 1 FPS sampling preserves nearly all semantically relevant information while dramatically reducing computational cost.
- Uniform sampling (base mode): Ensures equal representation of all temporal segments regardless of content density. This is more robust for videos with non-uniform temporal distributions of events.
The fixed frame count of 24 balances between:
- Sufficient temporal context: 24 frames provide enough temporal information for the model to understand actions, transitions, and temporal relationships.
- Computational feasibility: The vision encoder processes each frame independently, so 24 frames represent a manageable computational load.
Related Pages
- Implementation:Zai_org_CogVideo_Caption_Load_Video -- Implementation of frame extraction
- Zai_org_CogVideo_Caption_Model_Loading -- Previous step: loading the CogVLM2 model
- Zai_org_CogVideo_Caption_Generation -- Next step: generating captions from extracted frames