Implementation:Zai org CogVideo CogVLM2 Predict

Attribute	Value
Implementation Name	CogVLM2 Predict
Workflow	Video Captioning
Step	4 of 5
Type	API Doc
Source File	`tools/caption/video_caption.py:L72-100`
Repository	zai-org/CogVideo
External Dependencies	transformers, torch
Last Updated	2026-02-10 00:00 GMT

Overview

Implementation of the caption prediction function for the CogVLM2 video captioning pipeline. The predict function orchestrates video frame loading, input construction, and autoregressive text generation to produce a natural language description of the video content.

Description

The predict function:

Calls load_video(video_data) to extract representative frames
Uses the model's build_conversation_input_ids to construct multimodal input
Moves all input tensors to the target device with appropriate dtypes
Calls model.generate() with controlled generation parameters
Decodes the generated token IDs to text using the tokenizer
Returns the caption string

Key generation parameters are hardcoded for deterministic, high-quality captions:

max_new_tokens=2048
pad_token_id=128002 (Llama3 EOS token)
top_k=1
do_sample=False
top_p=0.1

Usage

from tools.caption.video_caption import predict

with open("video.mp4", "rb") as f:
    video_data = f.read()

caption = predict(
    prompt="Please describe this video in detail.",
    video_data=video_data,
    temperature=0.1
)
print(caption)

Code Reference

Source Location

File	Lines	Description
`tools/caption/video_caption.py`	L72-100	`predict` function

Signature

def predict(
    prompt: str,          # e.g. "Please describe this video in detail."
    video_data: bytes,    # Raw video file bytes
    temperature: float    # e.g. 0.1
) -> str:                 # Generated caption

Import

from tools.caption.video_caption import predict

I/O Contract

Inputs

Parameter	Type	Default	Description
`prompt`	`str`	Required	Instruction prompt for the model (e.g., `"Please describe this video in detail."`)
`video_data`	`bytes`	Required	Raw video file bytes
`temperature`	`float`	Required	Temperature for generation (typically 0.1 for deterministic captions)

Internal generation kwargs (hardcoded)

Parameter	Value	Description
`max_new_tokens`	`2048`	Maximum number of tokens to generate
`pad_token_id`	`128002`	Llama3 EOS token ID used for padding
`top_k`	`1`	Greedy decoding (select most probable token)
`do_sample`	`False`	Disable stochastic sampling
`top_p`	`0.1`	Nucleus sampling threshold

Outputs

Output	Type	Description
Return value	`str`	Generated caption text describing the video content

Usage Examples

Example 1: Basic caption generation

from tools.caption.video_caption import predict

with open("cooking_video.mp4", "rb") as f:
    video_data = f.read()

caption = predict(
    prompt="Please describe this video in detail.",
    video_data=video_data,
    temperature=0.1
)
print(caption)
# Output: "The video shows a person in a kitchen preparing a meal.
#          They begin by chopping vegetables on a wooden cutting board..."

Example 2: Specific aspect captioning

caption = predict(
    prompt="Describe the main actions happening in this video.",
    video_data=video_data,
    temperature=0.1
)

Example 3: Batch captioning multiple videos

import os

video_dir = "/data/videos/"
captions = {}

for filename in os.listdir(video_dir):
    if filename.endswith(".mp4"):
        with open(os.path.join(video_dir, filename), "rb") as f:
            video_data = f.read()
        caption = predict(
            prompt="Please describe this video in detail.",
            video_data=video_data,
            temperature=0.1
        )
        captions[filename] = caption
        print(f"{filename}: {caption[:100]}...")

Related Pages

Principle:Zai_org_CogVideo_Caption_Generation -- Principle governing caption generation
Environment:Zai_org_CogVideo_Video_Captioning_Environment
Zai_org_CogVideo_Caption_Load_Video -- Frame extraction called internally by predict
Zai_org_CogVideo_Caption_File_Output -- Next step: saving the generated caption
Zai_org_CogVideo_CogVLM2_Model_Loading -- Model loading that provides the model and tokenizer used by predict

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment