Workflow:Zai org CogVideo Video Captioning
| Knowledge Sources | |
|---|---|
| Domains | Video_Understanding, Data_Preparation, Captioning |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
End-to-end process for generating detailed text descriptions from video files using the CogVLM2 vision-language model, producing captions suitable for training CogVideoX.
Description
This workflow automates the creation of text captions from video files using the CogVLM2-LLaMA3 vision-language model. It extracts representative frames from each video, feeds them to the multimodal model with a captioning prompt, and generates detailed natural language descriptions. The resulting captions can be used directly as training data for CogVideoX fine-tuning workflows. This is a critical data preparation step for building custom fine-tuning datasets where manual captioning is impractical.
Usage
Execute this workflow when you have a collection of video files that need text captions for fine-tuning CogVideoX. This is typically the first step in a dataset preparation pipeline, before organizing the data for either Diffusers-based or SAT-based fine-tuning. The output text files can be directly used as the caption_column input for the training workflows.
Execution Steps
Step 1: Environment Setup
Install the captioning dependencies specified in the tools/caption requirements file. The key dependency is the CogVLM2-LLaMA3-Caption model from THUDM, which requires the transformers library with trust_remote_code enabled. Ensure sufficient GPU memory for the vision-language model (bfloat16 on Ampere+ GPUs, float16 on older GPUs).
Key considerations:
- Dependencies are in `tools/caption/requirements.txt`
- CogVLM2 model requires significant VRAM (approximately 20-30GB)
- Supports optional 4-bit or 8-bit quantization to reduce memory
- Requires CUDA-capable GPU with compute capability 8.0+ for bf16
Step 2: Model Loading
Load the CogVLM2-LLaMA3-Caption model and its tokenizer from HuggingFace Hub. The model is loaded with `trust_remote_code=True` to enable the custom architecture. Precision is automatically selected based on GPU compute capability (bfloat16 for Ampere+, float16 otherwise).
Key considerations:
- Model path: `THUDM/cogvlm2-llama3-caption`
- Optional quantization (4-bit or 8-bit) reduces memory requirements
- The model is set to evaluation mode after loading
- Tokenizer is loaded from the same model path
Step 3: Video Frame Extraction
For each input video, extract representative frames using a configurable sampling strategy. The "chat" strategy samples one frame per second up to the maximum frame count. The "base" strategy uniformly samples frames from a specified time range. Frames are extracted using decord for efficient video decoding and assembled into a tensor.
Key considerations:
- Default strategy is "chat" (one frame per second)
- Maximum 24 frames are extracted per video
- Frames are arranged in CTHW format (channels, time, height, width)
- decord library handles efficient video decoding
Step 4: Caption Generation
Feed the extracted video frames and a captioning prompt to the CogVLM2 model. The model processes the visual input alongside the text prompt to generate a detailed natural language description of the video content. Generation parameters control output quality (temperature, top-k, max tokens).
Key considerations:
- Default prompt: "Please describe this video in detail."
- Generation uses greedy decoding (top_k=1) for deterministic output
- Maximum output length is 2048 tokens
- Temperature controls caption diversity (default 0.1 for consistency)
- Inference runs with torch.no_grad() for memory efficiency
Step 5: Caption Output
Collect the generated captions and write them to text files in the format expected by the CogVideoX training pipelines. Each caption corresponds to one video file and is stored as a single text entry.
Key considerations:
- Output format should match the caption_column format for fine-tuning
- One caption per line in the output text file
- Captions should be reviewed for quality before using in training
- Special tokens are stripped from the model output