Environment:Zai org CogVideo Video Captioning Environment
| Knowledge Sources | |
|---|---|
| Domains | Video_Captioning, Vision_Language_Model, Data_Preparation |
| Last Updated | 2026-02-10 02:00 GMT |
Overview
Linux GPU environment with Python 3.8-3.11, PyTorch == 2.1.0, CUDA, xformers, and CogVLM2 for generating text captions from video files to prepare training datasets.
Description
This environment provides the stack for running the CogVLM2-based video captioning pipeline. It uses the `THUDM/cogvlm2-llama3-caption` vision-language model to generate descriptive captions from video files. The environment is significantly more constrained than the other CogVideo environments, with pinned PyTorch 2.1.0 and transformers 4.42.4 versions. It dynamically selects bf16 or fp16 precision based on GPU compute capability (Ampere+ uses bf16, older GPUs use fp16). Optional 4-bit or 8-bit quantization is supported for memory-constrained GPUs.
Usage
Use this environment for the video captioning workflow to generate training labels. This is the prerequisite for running `tools/caption/video_caption.py`. The captioning environment is separate from the finetuning and inference environments due to incompatible pinned dependency versions.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | CUDA-compatible OS required |
| Hardware | NVIDIA GPU | Required; CPU fallback exists but is impractical |
| Hardware (full precision) | NVIDIA GPU >= 16GB VRAM | For loading CogVLM2 in fp16/bf16 |
| Hardware (quantized) | NVIDIA GPU >= 8GB VRAM | With 4-bit or 8-bit quantization |
| GPU Compute Capability | >= 8 for bf16 (Ampere+) | Falls back to fp16 on older GPUs (V100, etc.) |
| Python | >= 3.8, <= 3.11 | Explicitly constrained in requirements.txt |
| CUDA | Compatible with PyTorch 2.1.0 | Typically CUDA 11.8 or 12.1 |
Dependencies
System Packages
- NVIDIA CUDA Toolkit (compatible with PyTorch 2.1.0)
Python Packages (from tools/caption/requirements.txt)
- `torch` == 2.1.0 (pinned)
- `torchvision` == 0.16.0 (pinned)
- `transformers` == 4.42.4 (pinned)
- `pytorchvideo` == 0.1.5 (pinned)
- `xformers` (no version specified)
- `huggingface-hub` >= 0.23.0
- `decord` >= 0.6.0
- `pillow` (no version specified)
- `timm` >= 0.9.16
- `einops` (no version specified)
- `pydantic` >= 2.7.1
- `openai` >= 1.30.1
- `loguru` >= 0.7.2
- `chainlit` >= 1.0
- `sse-starlette` >= 2.1.0
- `flask` (no version specified)
- `gunicorn` (no version specified)
- `gevent` (no version specified)
- `requests` (no version specified)
- `gradio` (no version specified)
Credentials
No API tokens are required. The CogVLM2 model is loaded from the public HuggingFace repository `THUDM/cogvlm2-llama3-caption`.
Quick Install
# Install from requirements file (recommended)
pip install -r tools/caption/requirements.txt
# Or install manually
pip install torch==2.1.0 torchvision==0.16.0 transformers==4.42.4 \
pytorchvideo==0.1.5 xformers huggingface-hub>=0.23.0 decord>=0.6.0 \
pillow timm>=0.9.16 einops pydantic>=2.7.1 loguru>=0.7.2
Code Evidence
Dynamic dtype selection based on GPU capability from `tools/caption/video_caption.py:12-16`:
TORCH_TYPE = (
torch.bfloat16
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8
else torch.float16
)
CUDA device detection from `tools/caption/video_caption.py:11`:
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
Python version constraint from `tools/caption/requirements.txt:2`:
# python version [3.8,3.11]
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: xformers` | xformers not installed | `pip install xformers` (must match torch 2.1.0) |
| CUDA OOM loading CogVLM2 | GPU VRAM insufficient for full precision | Use `--quant 4` or `--quant 8` for quantized loading |
| `RuntimeError: bfloat16 not supported` | GPU compute capability < 8 | Code auto-detects and falls back to fp16; ensure torch is properly installed |
| Version conflict with finetuning env | Pinned torch 2.1.0 conflicts with finetuning torch >= 2.5.1 | Use a separate virtual environment for captioning |
Compatibility Notes
- Separate virtual environment required: The captioning environment pins `torch==2.1.0` and `transformers==4.42.4`, which conflict with the finetuning/inference environments. Always use a separate venv or conda environment.
- GPU Compute Capability >= 8 (Ampere+): Uses bf16 for faster processing. Older GPUs (V100, Tesla T4) automatically fall back to fp16.
- Quantization: Supports optional 4-bit or 8-bit quantization via `--quant` argument for memory-constrained GPUs.
- xformers: Required but unpinned. Must install a version compatible with torch 2.1.0.
- Python 3.12+: Not supported due to pytorchvideo and other pinned dependency constraints.