Environment:Zai org CogVideo Video Captioning Environment

Knowledge Sources	CogVideo CogVLM2
Domains	Video_Captioning, Vision_Language_Model, Data_Preparation
Last Updated	2026-02-10 02:00 GMT

Overview

Linux GPU environment with Python 3.8-3.11, PyTorch == 2.1.0, CUDA, xformers, and CogVLM2 for generating text captions from video files to prepare training datasets.

Description

This environment provides the stack for running the CogVLM2-based video captioning pipeline. It uses the `THUDM/cogvlm2-llama3-caption` vision-language model to generate descriptive captions from video files. The environment is significantly more constrained than the other CogVideo environments, with pinned PyTorch 2.1.0 and transformers 4.42.4 versions. It dynamically selects bf16 or fp16 precision based on GPU compute capability (Ampere+ uses bf16, older GPUs use fp16). Optional 4-bit or 8-bit quantization is supported for memory-constrained GPUs.

Usage

Use this environment for the video captioning workflow to generate training labels. This is the prerequisite for running `tools/caption/video_caption.py`. The captioning environment is separate from the finetuning and inference environments due to incompatible pinned dependency versions.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	CUDA-compatible OS required
Hardware	NVIDIA GPU	Required; CPU fallback exists but is impractical
Hardware (full precision)	NVIDIA GPU >= 16GB VRAM	For loading CogVLM2 in fp16/bf16
Hardware (quantized)	NVIDIA GPU >= 8GB VRAM	With 4-bit or 8-bit quantization
GPU Compute Capability	>= 8 for bf16 (Ampere+)	Falls back to fp16 on older GPUs (V100, etc.)
Python	>= 3.8, <= 3.11	Explicitly constrained in requirements.txt
CUDA	Compatible with PyTorch 2.1.0	Typically CUDA 11.8 or 12.1

Dependencies

System Packages

NVIDIA CUDA Toolkit (compatible with PyTorch 2.1.0)

Python Packages (from tools/caption/requirements.txt)

`torch` == 2.1.0 (pinned)
`torchvision` == 0.16.0 (pinned)
`transformers` == 4.42.4 (pinned)
`pytorchvideo` == 0.1.5 (pinned)
`xformers` (no version specified)
`huggingface-hub` >= 0.23.0
`decord` >= 0.6.0
`pillow` (no version specified)
`timm` >= 0.9.16
`einops` (no version specified)
`pydantic` >= 2.7.1
`openai` >= 1.30.1
`loguru` >= 0.7.2
`chainlit` >= 1.0
`sse-starlette` >= 2.1.0
`flask` (no version specified)
`gunicorn` (no version specified)
`gevent` (no version specified)
`requests` (no version specified)
`gradio` (no version specified)

Credentials

No API tokens are required. The CogVLM2 model is loaded from the public HuggingFace repository `THUDM/cogvlm2-llama3-caption`.

Quick Install

# Install from requirements file (recommended)
pip install -r tools/caption/requirements.txt

# Or install manually
pip install torch==2.1.0 torchvision==0.16.0 transformers==4.42.4 \
    pytorchvideo==0.1.5 xformers huggingface-hub>=0.23.0 decord>=0.6.0 \
    pillow timm>=0.9.16 einops pydantic>=2.7.1 loguru>=0.7.2

Code Evidence

Dynamic dtype selection based on GPU capability from `tools/caption/video_caption.py:12-16`:

TORCH_TYPE = (
    torch.bfloat16
    if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8
    else torch.float16
)

CUDA device detection from `tools/caption/video_caption.py:11`:

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

Python version constraint from `tools/caption/requirements.txt:2`:

# python version [3.8,3.11]

Common Errors

Error Message	Cause	Solution
`ImportError: xformers`	xformers not installed	`pip install xformers` (must match torch 2.1.0)
CUDA OOM loading CogVLM2	GPU VRAM insufficient for full precision	Use `--quant 4` or `--quant 8` for quantized loading
`RuntimeError: bfloat16 not supported`	GPU compute capability < 8	Code auto-detects and falls back to fp16; ensure torch is properly installed
Version conflict with finetuning env	Pinned torch 2.1.0 conflicts with finetuning torch >= 2.5.1	Use a separate virtual environment for captioning

Compatibility Notes

Separate virtual environment required: The captioning environment pins `torch==2.1.0` and `transformers==4.42.4`, which conflict with the finetuning/inference environments. Always use a separate venv or conda environment.
GPU Compute Capability >= 8 (Ampere+): Uses bf16 for faster processing. Older GPUs (V100, Tesla T4) automatically fall back to fp16.
Quantization: Supports optional 4-bit or 8-bit quantization via `--quant` argument for memory-constrained GPUs.
xformers: Required but unpinned. Must install a version compatible with torch 2.1.0.
Python 3.12+: Not supported due to pytorchvideo and other pinned dependency constraints.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment