Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo Video Frame Extraction

From Leeroopedia


Attribute Value
Principle Name Video Frame Extraction
Workflow Video Captioning
Step 3 of 5
Type Data Input
Repository zai-org/CogVideo
Paper CogVLM2
Last Updated 2026-02-10 00:00 GMT

Overview

Technique for extracting representative frames from a video for input to a vision-language captioning model. Frame extraction samples a fixed number of frames from a video using configurable temporal sampling strategies.

Description

Frame extraction samples a fixed number of frames (24) from a video using one of two strategies:

  1. Chat mode (strategy="chat"): Samples 1 frame per second up to 24 frames. This strategy is better for natural scenes where temporal events occur at human-perceptible timescales.
  2. Base mode (strategy="base"): Uniformly samples 24 frames across the entire video duration. This strategy provides even temporal coverage regardless of video length.

The frames are loaded from raw video bytes via decord's VideoReader and returned as a tensor in [C, T, H, W] format, suitable for input to the CogVLM2 vision encoder.

Usage

Use Video Frame Extraction after model loading and before caption generation. The extracted frames are passed to the predict function as part of the video data.

Theoretical Basis

Temporal sampling reduces the video to a fixed number of representative frames that capture the essential visual content:

  • 1-frame-per-second sampling (chat mode): Captures temporal events at human-perceptible granularity. Since most meaningful visual changes in natural videos occur at timescales of seconds rather than milliseconds, 1 FPS sampling preserves nearly all semantically relevant information while dramatically reducing computational cost.
  • Uniform sampling (base mode): Ensures equal representation of all temporal segments regardless of content density. This is more robust for videos with non-uniform temporal distributions of events.

The fixed frame count of 24 balances between:

  • Sufficient temporal context: 24 frames provide enough temporal information for the model to understand actions, transitions, and temporal relationships.
  • Computational feasibility: The vision encoder processes each frame independently, so 24 frames represent a manageable computational load.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment