Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Zai org CogVideo Video Captioning

From Leeroopedia
Revision as of 11:05, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Zai_org_CogVideo_Video_Captioning.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Video_Understanding, Data_Preparation, Captioning
Last Updated 2026-02-10 12:00 GMT

Overview

End-to-end process for generating detailed text descriptions from video files using the CogVLM2 vision-language model, producing captions suitable for training CogVideoX.

Description

This workflow automates the creation of text captions from video files using the CogVLM2-LLaMA3 vision-language model. It extracts representative frames from each video, feeds them to the multimodal model with a captioning prompt, and generates detailed natural language descriptions. The resulting captions can be used directly as training data for CogVideoX fine-tuning workflows. This is a critical data preparation step for building custom fine-tuning datasets where manual captioning is impractical.

Usage

Execute this workflow when you have a collection of video files that need text captions for fine-tuning CogVideoX. This is typically the first step in a dataset preparation pipeline, before organizing the data for either Diffusers-based or SAT-based fine-tuning. The output text files can be directly used as the caption_column input for the training workflows.

Execution Steps

Step 1: Environment Setup

Install the captioning dependencies specified in the tools/caption requirements file. The key dependency is the CogVLM2-LLaMA3-Caption model from THUDM, which requires the transformers library with trust_remote_code enabled. Ensure sufficient GPU memory for the vision-language model (bfloat16 on Ampere+ GPUs, float16 on older GPUs).

Key considerations:

  • Dependencies are in `tools/caption/requirements.txt`
  • CogVLM2 model requires significant VRAM (approximately 20-30GB)
  • Supports optional 4-bit or 8-bit quantization to reduce memory
  • Requires CUDA-capable GPU with compute capability 8.0+ for bf16

Step 2: Model Loading

Load the CogVLM2-LLaMA3-Caption model and its tokenizer from HuggingFace Hub. The model is loaded with `trust_remote_code=True` to enable the custom architecture. Precision is automatically selected based on GPU compute capability (bfloat16 for Ampere+, float16 otherwise).

Key considerations:

  • Model path: `THUDM/cogvlm2-llama3-caption`
  • Optional quantization (4-bit or 8-bit) reduces memory requirements
  • The model is set to evaluation mode after loading
  • Tokenizer is loaded from the same model path

Step 3: Video Frame Extraction

For each input video, extract representative frames using a configurable sampling strategy. The "chat" strategy samples one frame per second up to the maximum frame count. The "base" strategy uniformly samples frames from a specified time range. Frames are extracted using decord for efficient video decoding and assembled into a tensor.

Key considerations:

  • Default strategy is "chat" (one frame per second)
  • Maximum 24 frames are extracted per video
  • Frames are arranged in CTHW format (channels, time, height, width)
  • decord library handles efficient video decoding

Step 4: Caption Generation

Feed the extracted video frames and a captioning prompt to the CogVLM2 model. The model processes the visual input alongside the text prompt to generate a detailed natural language description of the video content. Generation parameters control output quality (temperature, top-k, max tokens).

Key considerations:

  • Default prompt: "Please describe this video in detail."
  • Generation uses greedy decoding (top_k=1) for deterministic output
  • Maximum output length is 2048 tokens
  • Temperature controls caption diversity (default 0.1 for consistency)
  • Inference runs with torch.no_grad() for memory efficiency

Step 5: Caption Output

Collect the generated captions and write them to text files in the format expected by the CogVideoX training pipelines. Each caption corresponds to one video file and is stored as a single text entry.

Key considerations:

  • Output format should match the caption_column format for fine-tuning
  • One caption per line in the output text file
  • Captions should be reviewed for quality before using in training
  • Special tokens are stripped from the model output

Execution Diagram

GitHub URL

Workflow Repository