Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Zai org CogVideo Video Captioning Environment

From Leeroopedia


Knowledge Sources
Domains Video_Captioning, Vision_Language_Model, Data_Preparation
Last Updated 2026-02-10 02:00 GMT

Overview

Linux GPU environment with Python 3.8-3.11, PyTorch == 2.1.0, CUDA, xformers, and CogVLM2 for generating text captions from video files to prepare training datasets.

Description

This environment provides the stack for running the CogVLM2-based video captioning pipeline. It uses the `THUDM/cogvlm2-llama3-caption` vision-language model to generate descriptive captions from video files. The environment is significantly more constrained than the other CogVideo environments, with pinned PyTorch 2.1.0 and transformers 4.42.4 versions. It dynamically selects bf16 or fp16 precision based on GPU compute capability (Ampere+ uses bf16, older GPUs use fp16). Optional 4-bit or 8-bit quantization is supported for memory-constrained GPUs.

Usage

Use this environment for the video captioning workflow to generate training labels. This is the prerequisite for running `tools/caption/video_caption.py`. The captioning environment is separate from the finetuning and inference environments due to incompatible pinned dependency versions.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) CUDA-compatible OS required
Hardware NVIDIA GPU Required; CPU fallback exists but is impractical
Hardware (full precision) NVIDIA GPU >= 16GB VRAM For loading CogVLM2 in fp16/bf16
Hardware (quantized) NVIDIA GPU >= 8GB VRAM With 4-bit or 8-bit quantization
GPU Compute Capability >= 8 for bf16 (Ampere+) Falls back to fp16 on older GPUs (V100, etc.)
Python >= 3.8, <= 3.11 Explicitly constrained in requirements.txt
CUDA Compatible with PyTorch 2.1.0 Typically CUDA 11.8 or 12.1

Dependencies

System Packages

  • NVIDIA CUDA Toolkit (compatible with PyTorch 2.1.0)

Python Packages (from tools/caption/requirements.txt)

  • `torch` == 2.1.0 (pinned)
  • `torchvision` == 0.16.0 (pinned)
  • `transformers` == 4.42.4 (pinned)
  • `pytorchvideo` == 0.1.5 (pinned)
  • `xformers` (no version specified)
  • `huggingface-hub` >= 0.23.0
  • `decord` >= 0.6.0
  • `pillow` (no version specified)
  • `timm` >= 0.9.16
  • `einops` (no version specified)
  • `pydantic` >= 2.7.1
  • `openai` >= 1.30.1
  • `loguru` >= 0.7.2
  • `chainlit` >= 1.0
  • `sse-starlette` >= 2.1.0
  • `flask` (no version specified)
  • `gunicorn` (no version specified)
  • `gevent` (no version specified)
  • `requests` (no version specified)
  • `gradio` (no version specified)

Credentials

No API tokens are required. The CogVLM2 model is loaded from the public HuggingFace repository `THUDM/cogvlm2-llama3-caption`.

Quick Install

# Install from requirements file (recommended)
pip install -r tools/caption/requirements.txt

# Or install manually
pip install torch==2.1.0 torchvision==0.16.0 transformers==4.42.4 \
    pytorchvideo==0.1.5 xformers huggingface-hub>=0.23.0 decord>=0.6.0 \
    pillow timm>=0.9.16 einops pydantic>=2.7.1 loguru>=0.7.2

Code Evidence

Dynamic dtype selection based on GPU capability from `tools/caption/video_caption.py:12-16`:

TORCH_TYPE = (
    torch.bfloat16
    if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8
    else torch.float16
)

CUDA device detection from `tools/caption/video_caption.py:11`:

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

Python version constraint from `tools/caption/requirements.txt:2`:

# python version [3.8,3.11]

Common Errors

Error Message Cause Solution
`ImportError: xformers` xformers not installed `pip install xformers` (must match torch 2.1.0)
CUDA OOM loading CogVLM2 GPU VRAM insufficient for full precision Use `--quant 4` or `--quant 8` for quantized loading
`RuntimeError: bfloat16 not supported` GPU compute capability < 8 Code auto-detects and falls back to fp16; ensure torch is properly installed
Version conflict with finetuning env Pinned torch 2.1.0 conflicts with finetuning torch >= 2.5.1 Use a separate virtual environment for captioning

Compatibility Notes

  • Separate virtual environment required: The captioning environment pins `torch==2.1.0` and `transformers==4.42.4`, which conflict with the finetuning/inference environments. Always use a separate venv or conda environment.
  • GPU Compute Capability >= 8 (Ampere+): Uses bf16 for faster processing. Older GPUs (V100, Tesla T4) automatically fall back to fp16.
  • Quantization: Supports optional 4-bit or 8-bit quantization via `--quant` argument for memory-constrained GPUs.
  • xformers: Required but unpinned. Must install a version compatible with torch 2.1.0.
  • Python 3.12+: Not supported due to pytorchvideo and other pinned dependency constraints.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment