Workflow:Open compass VLMEvalKit Video Benchmark Evaluation
| Knowledge Sources | |
|---|---|
| Domains | VLM_Evaluation, Video_Understanding, Benchmarking |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
End-to-end process for evaluating Vision-Language Models on video understanding benchmarks using VLMEvalKit's video inference pipeline.
Description
This workflow covers evaluating VLMs on video benchmarks such as MVBench, Video-MME, MMBench-Video, MLVU, LongVideoBench, and TempCompass. Video evaluation differs from image evaluation in that it requires frame sampling configuration (number of frames or frames-per-second), video data downloading, and video-specific prompt construction. The toolkit supports both local video-capable models and API-based models, with options for packing multiple questions about the same video into a single query.
Usage
Execute this workflow when you need to evaluate a VLM's video understanding capabilities. You should have a model that supports video or multi-frame image inputs, sufficient storage for video datasets, and GPU resources appropriate for the model size. Video datasets are automatically downloaded from HuggingFace (or ModelScope if configured).
Execution Steps
Step 1: Installation and Video Dataset Configuration
Install VLMEvalKit and configure the environment. For video benchmarks, additional consideration is needed for dataset storage since video files are significantly larger than image datasets. Set VLMEVALKIT_USE_MODELSCOPE=1 if downloading from ModelScope is preferred.
Key considerations:
- Video datasets require substantially more disk space than image benchmarks
- Some video benchmarks (e.g., EgoExoBench) require additional preprocessing steps
- Ensure sufficient storage at the $LMUData path (defaults to $HOME/LMUData)
Step 2: Select Video-Capable Model
Choose a VLM that supports video input from the model registry. Video-capable models are listed in the video_models section of vlmeval/config.py. These include dedicated video models (VideoChat2, LLaVA-Video, PLLaVA) and multi-modal models with video support (InternVL, Qwen2-VL, GPT-4o).
Key considerations:
- Not all VLMs support video input; verify the model has video capabilities
- Video models are found in vlmeval/vlm/video_llm/ and some standard VLM adapters
- Use vlmutil check to validate the model before running
Step 3: Configure Video Benchmark Settings
Select and configure video benchmarks using pre-defined dataset settings from vlmeval/dataset/video_dataset_config.py. Key configuration parameters include the number of frames to sample (nframe) or frames-per-second (fps), and whether to pack multiple questions per video into one query (pack mode).
What happens:
- Pre-configured dataset settings combine benchmark name, frame count, and pack mode
- Example: MMBench_Video_8frame_nopack samples 8 frames without packing
- Example: Video-MME_1fps_subs samples at 1 fps with subtitles enabled
- Users can define custom configurations via JSON config file for advanced settings
- Only one of nframe or fps should be set (not both)
Step 4: Run Video Inference
Launch the video inference pipeline via run.py. The video inference engine (infer_data_job_video) handles video loading, frame extraction at the configured sampling rate, prompt construction with video frames, and distributed prediction generation.
What happens:
- Videos are loaded and frames are extracted based on nframe or fps settings
- Frames are passed to the model along with text prompts
- For pack mode, all questions about a video are combined into a single model query
- Results are saved as checkpoint files and merged across ranks
- The pipeline supports both python and torchrun launch modes
Step 5: Run Video Evaluation
After inference, the evaluation pipeline runs video-specific metrics. Each video benchmark implements its own evaluate() method with specialized scoring, often including per-dimension analysis (temporal reasoning, spatial understanding, etc.).
What happens:
- Video benchmark evaluation may score across multiple dimensions (e.g., action, temporal, spatial)
- Some benchmarks (MMBench-Video) use GPT-4-turbo as judge for multi-dimensional scoring
- Duration-based and category-based breakdowns are computed
- Results include both overall and per-category/per-dimension metrics
Step 6: Analyze Video Evaluation Results
Review results in the working directory. Video evaluation results typically include richer breakdowns than image benchmarks, with per-dimension, per-duration, and per-category analyses.
Key considerations:
- Check per-dimension scores for insights into model strengths and weaknesses
- Video benchmarks often have duration-based splits (short, medium, long)
- Use scripts/summarize.py to aggregate video benchmark scores