Implementation:Zai org CogVideo CogVLM2 Predict
Appearance
| Attribute | Value |
|---|---|
| Implementation Name | CogVLM2 Predict |
| Workflow | Video Captioning |
| Step | 4 of 5 |
| Type | API Doc |
| Source File | tools/caption/video_caption.py:L72-100
|
| Repository | zai-org/CogVideo |
| External Dependencies | transformers, torch |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implementation of the caption prediction function for the CogVLM2 video captioning pipeline. The predict function orchestrates video frame loading, input construction, and autoregressive text generation to produce a natural language description of the video content.
Description
The predict function:
- Calls
load_video(video_data)to extract representative frames - Uses the model's
build_conversation_input_idsto construct multimodal input - Moves all input tensors to the target device with appropriate dtypes
- Calls
model.generate()with controlled generation parameters - Decodes the generated token IDs to text using the tokenizer
- Returns the caption string
Key generation parameters are hardcoded for deterministic, high-quality captions:
max_new_tokens=2048pad_token_id=128002(Llama3 EOS token)top_k=1do_sample=Falsetop_p=0.1
Usage
from tools.caption.video_caption import predict
with open("video.mp4", "rb") as f:
video_data = f.read()
caption = predict(
prompt="Please describe this video in detail.",
video_data=video_data,
temperature=0.1
)
print(caption)
Code Reference
Source Location
| File | Lines | Description |
|---|---|---|
tools/caption/video_caption.py |
L72-100 | predict function
|
Signature
def predict(
prompt: str, # e.g. "Please describe this video in detail."
video_data: bytes, # Raw video file bytes
temperature: float # e.g. 0.1
) -> str: # Generated caption
Import
from tools.caption.video_caption import predict
I/O Contract
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
str |
Required | Instruction prompt for the model (e.g., "Please describe this video in detail.")
|
video_data |
bytes |
Required | Raw video file bytes |
temperature |
float |
Required | Temperature for generation (typically 0.1 for deterministic captions) |
Internal generation kwargs (hardcoded)
| Parameter | Value | Description |
|---|---|---|
max_new_tokens |
2048 |
Maximum number of tokens to generate |
pad_token_id |
128002 |
Llama3 EOS token ID used for padding |
top_k |
1 |
Greedy decoding (select most probable token) |
do_sample |
False |
Disable stochastic sampling |
top_p |
0.1 |
Nucleus sampling threshold |
Outputs
| Output | Type | Description |
|---|---|---|
| Return value | str |
Generated caption text describing the video content |
Usage Examples
Example 1: Basic caption generation
from tools.caption.video_caption import predict
with open("cooking_video.mp4", "rb") as f:
video_data = f.read()
caption = predict(
prompt="Please describe this video in detail.",
video_data=video_data,
temperature=0.1
)
print(caption)
# Output: "The video shows a person in a kitchen preparing a meal.
# They begin by chopping vegetables on a wooden cutting board..."
Example 2: Specific aspect captioning
caption = predict(
prompt="Describe the main actions happening in this video.",
video_data=video_data,
temperature=0.1
)
Example 3: Batch captioning multiple videos
import os
video_dir = "/data/videos/"
captions = {}
for filename in os.listdir(video_dir):
if filename.endswith(".mp4"):
with open(os.path.join(video_dir, filename), "rb") as f:
video_data = f.read()
caption = predict(
prompt="Please describe this video in detail.",
video_data=video_data,
temperature=0.1
)
captions[filename] = caption
print(f"{filename}: {caption[:100]}...")
Related Pages
- Principle:Zai_org_CogVideo_Caption_Generation -- Principle governing caption generation
- Environment:Zai_org_CogVideo_Video_Captioning_Environment
- Zai_org_CogVideo_Caption_Load_Video -- Frame extraction called internally by predict
- Zai_org_CogVideo_Caption_File_Output -- Next step: saving the generated caption
- Zai_org_CogVideo_CogVLM2_Model_Loading -- Model loading that provides the model and tokenizer used by predict
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment