Workflow:Open compass VLMEvalKit Video Benchmark Evaluation

Knowledge Sources	VLMEvalKit Quickstart Guide Config System
Domains	VLM_Evaluation, Video_Understanding, Benchmarking
Last Updated	2026-02-14 00:00 GMT

Overview

End-to-end process for evaluating Vision-Language Models on video understanding benchmarks using VLMEvalKit's video inference pipeline.

Description

This workflow covers evaluating VLMs on video benchmarks such as MVBench, Video-MME, MMBench-Video, MLVU, LongVideoBench, and TempCompass. Video evaluation differs from image evaluation in that it requires frame sampling configuration (number of frames or frames-per-second), video data downloading, and video-specific prompt construction. The toolkit supports both local video-capable models and API-based models, with options for packing multiple questions about the same video into a single query.

Usage

Execute this workflow when you need to evaluate a VLM's video understanding capabilities. You should have a model that supports video or multi-frame image inputs, sufficient storage for video datasets, and GPU resources appropriate for the model size. Video datasets are automatically downloaded from HuggingFace (or ModelScope if configured).

Execution Steps

Step 1: Installation and Video Dataset Configuration

Install VLMEvalKit and configure the environment. For video benchmarks, additional consideration is needed for dataset storage since video files are significantly larger than image datasets. Set VLMEVALKIT_USE_MODELSCOPE=1 if downloading from ModelScope is preferred.

Key considerations:

Video datasets require substantially more disk space than image benchmarks
Some video benchmarks (e.g., EgoExoBench) require additional preprocessing steps
Ensure sufficient storage at the $LMUData path (defaults to $HOME/LMUData)

Step 2: Select Video-Capable Model

Choose a VLM that supports video input from the model registry. Video-capable models are listed in the video_models section of vlmeval/config.py. These include dedicated video models (VideoChat2, LLaVA-Video, PLLaVA) and multi-modal models with video support (InternVL, Qwen2-VL, GPT-4o).

Key considerations:

Not all VLMs support video input; verify the model has video capabilities
Video models are found in vlmeval/vlm/video_llm/ and some standard VLM adapters
Use vlmutil check to validate the model before running

Step 3: Configure Video Benchmark Settings

Select and configure video benchmarks using pre-defined dataset settings from vlmeval/dataset/video_dataset_config.py. Key configuration parameters include the number of frames to sample (nframe) or frames-per-second (fps), and whether to pack multiple questions per video into one query (pack mode).

What happens:

Pre-configured dataset settings combine benchmark name, frame count, and pack mode
Example: MMBench_Video_8frame_nopack samples 8 frames without packing
Example: Video-MME_1fps_subs samples at 1 fps with subtitles enabled
Users can define custom configurations via JSON config file for advanced settings
Only one of nframe or fps should be set (not both)

Step 4: Run Video Inference

Launch the video inference pipeline via run.py. The video inference engine (infer_data_job_video) handles video loading, frame extraction at the configured sampling rate, prompt construction with video frames, and distributed prediction generation.

What happens:

Videos are loaded and frames are extracted based on nframe or fps settings
Frames are passed to the model along with text prompts
For pack mode, all questions about a video are combined into a single model query
Results are saved as checkpoint files and merged across ranks
The pipeline supports both python and torchrun launch modes

Step 5: Run Video Evaluation

After inference, the evaluation pipeline runs video-specific metrics. Each video benchmark implements its own evaluate() method with specialized scoring, often including per-dimension analysis (temporal reasoning, spatial understanding, etc.).

What happens:

Video benchmark evaluation may score across multiple dimensions (e.g., action, temporal, spatial)
Some benchmarks (MMBench-Video) use GPT-4-turbo as judge for multi-dimensional scoring
Duration-based and category-based breakdowns are computed
Results include both overall and per-category/per-dimension metrics

Step 6: Analyze Video Evaluation Results

Review results in the working directory. Video evaluation results typically include richer breakdowns than image benchmarks, with per-dimension, per-duration, and per-category analyses.

Key considerations:

Check per-dimension scores for insights into model strengths and weaknesses
Video benchmarks often have duration-based splits (short, medium, long)
Use scripts/summarize.py to aggregate video benchmark scores

Execution Diagram

GitHub URL

Workflow Repository