Heuristic:Open compass VLMEvalKit Video Frame Sampling Configuration
| Knowledge Sources | |
|---|---|
| Domains | Video_Understanding, Optimization |
| Last Updated | 2026-02-14 01:30 GMT |
Overview
Configuration guidelines for video frame sampling in VLMEvalKit: choosing between `nframe` (fixed count) and `fps` (frames per second), handling pack mode, and resolving model-dataset parameter conflicts.
Description
VLMEvalKit supports two mutually exclusive video frame sampling strategies: nframe (extract a fixed number of frames) and fps (extract frames at a given rate). The dataset configuration specifies the default, but the model may have its own preferences. The inference video pipeline includes negotiation logic that overrides model defaults with dataset settings when they conflict, with special handling for specific model families (Qwen2-VL, Gemini).
Usage
Apply this heuristic when configuring video benchmark evaluations. Understanding the nframe/fps interaction prevents `ValueError` exceptions and ensures correct frame sampling for each model-dataset combination.
The Insight (Rule of Thumb)
- Rule 1: `nframe` and `fps` are mutually exclusive. Setting both raises `ValueError` (`run.py:90`).
- Rule 2: At least one of `nframe` or `fps` must be positive. Setting both to zero raises `ValueError` (`run.py:91-92`).
- Rule 3: Dataset settings override model defaults. If the dataset specifies `nframe=16` but the model defaults to `nframe=8`, the model is overridden with a warning.
- Rule 4: Qwen2-VL and Qwen2.5-VL use their own internal frame sampling; dataset `nframe`/`fps` are ignored with a notification.
- Rule 5: Gemini with `genai` backend does not support `nframe`; it is automatically set to non-video-LLM mode (multi-image input) when `nframe > 0`.
- Rule 6: Gemini with `vertex` backend does not support video input at all; always uses multi-image fallback.
- Trade-off: Higher `nframe`/`fps` improves temporal coverage but increases memory usage and API costs.
Reasoning
Different video models handle temporal information differently. Some models natively process video streams (e.g., Gemini with fps), while others treat videos as sequences of images (e.g., most local VLMs with nframe). The negotiation logic ensures each model receives frames in its preferred format while respecting the dataset's evaluation protocol. The Qwen2-VL family has its own optimized frame extraction that should not be overridden.
Code Evidence
Mutual exclusivity check from `run.py:89-92`:
if cls.MODALITY == 'VIDEO':
if valid_params.get('fps', 0) > 0 and valid_params.get('nframe', 0) > 0:
raise ValueError('fps and nframe should not be set at the same time')
if valid_params.get('fps', 0) <= 0 and valid_params.get('nframe', 0) <= 0:
raise ValueError('fps and nframe should be set at least one valid value')
Model-dataset nframe negotiation from `vlmeval/inference_video.py:139-156`:
if getattr(model, 'nframe', None) is not None and getattr(model, 'nframe', 0) > 0:
if dataset.nframe > 0:
if getattr(model, 'nframe', 0) != dataset.nframe:
print(f'{model_name} is a video-llm model, nframe is set to {dataset.nframe}, not using default')
setattr(model, 'nframe', dataset.nframe)
elif getattr(model, 'fps', 0) == 0:
raise ValueError(f'fps is not suitable for {model_name}')
else:
setattr(model, 'nframe', None)
Gemini genai backend handling from `vlmeval/inference_video.py:29-40`:
if getattr(model,'backend', None) == 'genai':
if dataset.nframe > 0:
print(
'Gemini model (with genai backend) does not support nframe, '
'will set its VIDEO_LLM to False to enable multi-image input for video.'
)
setattr(model, 'VIDEO_LLM', False)
else:
print('Gemini model (with genai backend) is a video-llm, '
'will reset fps setting in model to match the dataset.')
setattr(model, 'fps', dataset.fps)
Qwen2-VL special handling from `vlmeval/inference_video.py:157-165`:
if (
'Qwen2-VL' in model_name
or 'Qwen2.5-VL' in model_name
or 'Qwen2.5-Omni' in model_name
):
if getattr(model, 'nframe', None) is None and dataset.nframe > 0:
print(f'using {model_name} default setting for video, dataset.nframe is ommitted')
if getattr(model, 'fps', None) is None and dataset.fps > 0:
print(f'using {model_name} default setting for video, dataset.fps is ommitted')