Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Zai org CogVideo CogVLM2 Predict

From Leeroopedia


Attribute Value
Implementation Name CogVLM2 Predict
Workflow Video Captioning
Step 4 of 5
Type API Doc
Source File tools/caption/video_caption.py:L72-100
Repository zai-org/CogVideo
External Dependencies transformers, torch
Last Updated 2026-02-10 00:00 GMT

Overview

Implementation of the caption prediction function for the CogVLM2 video captioning pipeline. The predict function orchestrates video frame loading, input construction, and autoregressive text generation to produce a natural language description of the video content.

Description

The predict function:

  1. Calls load_video(video_data) to extract representative frames
  2. Uses the model's build_conversation_input_ids to construct multimodal input
  3. Moves all input tensors to the target device with appropriate dtypes
  4. Calls model.generate() with controlled generation parameters
  5. Decodes the generated token IDs to text using the tokenizer
  6. Returns the caption string

Key generation parameters are hardcoded for deterministic, high-quality captions:

  • max_new_tokens=2048
  • pad_token_id=128002 (Llama3 EOS token)
  • top_k=1
  • do_sample=False
  • top_p=0.1

Usage

from tools.caption.video_caption import predict

with open("video.mp4", "rb") as f:
    video_data = f.read()

caption = predict(
    prompt="Please describe this video in detail.",
    video_data=video_data,
    temperature=0.1
)
print(caption)

Code Reference

Source Location

File Lines Description
tools/caption/video_caption.py L72-100 predict function

Signature

def predict(
    prompt: str,          # e.g. "Please describe this video in detail."
    video_data: bytes,    # Raw video file bytes
    temperature: float    # e.g. 0.1
) -> str:                 # Generated caption

Import

from tools.caption.video_caption import predict

I/O Contract

Inputs

Parameter Type Default Description
prompt str Required Instruction prompt for the model (e.g., "Please describe this video in detail.")
video_data bytes Required Raw video file bytes
temperature float Required Temperature for generation (typically 0.1 for deterministic captions)

Internal generation kwargs (hardcoded)

Parameter Value Description
max_new_tokens 2048 Maximum number of tokens to generate
pad_token_id 128002 Llama3 EOS token ID used for padding
top_k 1 Greedy decoding (select most probable token)
do_sample False Disable stochastic sampling
top_p 0.1 Nucleus sampling threshold

Outputs

Output Type Description
Return value str Generated caption text describing the video content

Usage Examples

Example 1: Basic caption generation

from tools.caption.video_caption import predict

with open("cooking_video.mp4", "rb") as f:
    video_data = f.read()

caption = predict(
    prompt="Please describe this video in detail.",
    video_data=video_data,
    temperature=0.1
)
print(caption)
# Output: "The video shows a person in a kitchen preparing a meal.
#          They begin by chopping vegetables on a wooden cutting board..."

Example 2: Specific aspect captioning

caption = predict(
    prompt="Describe the main actions happening in this video.",
    video_data=video_data,
    temperature=0.1
)

Example 3: Batch captioning multiple videos

import os

video_dir = "/data/videos/"
captions = {}

for filename in os.listdir(video_dir):
    if filename.endswith(".mp4"):
        with open(os.path.join(video_dir, filename), "rb") as f:
            video_data = f.read()
        caption = predict(
            prompt="Please describe this video in detail.",
            video_data=video_data,
            temperature=0.1
        )
        captions[filename] = caption
        print(f"{filename}: {caption[:100]}...")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment