Implementation:Open compass VLMEvalKit MVBench Utils
| Field | Value |
|---|---|
| source | VLMEvalKit |
| domain | Vision, Evaluation, Video Understanding, Multiple Choice |
Overview
Provides evaluation utilities for the MVBench video understanding benchmark, including dimension-based scoring and answer checking with optional LLM-as-judge.
Description
This module implements get_dimension_rating for computing per-task-type accuracy breakdowns from scored data files, and check_ans for comparing predicted multiple-choice answers against ground-truth options. The check_ans function performs flexible option matching by extracting the first word of predictions and ground-truth, handling period removal and case-insensitive comparison. The check_ans_with_model function extends this with LLM-based answer verification when simple string matching is insufficient. Results are aggregated by task type with percentage formatting.
Usage
Called internally by the MVBench dataset class during video understanding evaluation.
Code Reference
- Source:
vlmeval/dataset/utils/mvbench.py, Lines: L1-509 - Import:
from vlmeval.dataset.utils.mvbench import get_dimension_rating, check_ans
Key Functions:
def get_dimension_rating(data_path): ...
def check_ans(pred, gt): ...
def check_ans_with_model(pred, gt, model, item, dataset_name='MVBench'): ...
I/O Contract
| Direction | Description |
|---|---|
| Inputs | Scored data file path for dimension rating; predicted and ground-truth answer strings for answer checking |
| Outputs | Dictionary mapping task types to [correct, total, percentage] lists; boolean correctness for individual answers |
Usage Examples
# Internal usage example
from vlmeval.dataset.utils.mvbench import get_dimension_rating, check_ans
results = get_dimension_rating("scores.xlsx")
is_correct = check_ans("A. cat", "A cat")