Implementation:Open compass VLMEvalKit CGBench Utils
| Field | Value |
|---|---|
| source | VLMEvalKit |
| domain | Vision, Evaluation, Video Understanding, Clue-Grounded |
Overview
Provides evaluation utilities for the CGBench (Clue-Grounded Video Understanding) benchmark, including open-ended answer evaluation with LLM-as-judge.
Description
This module implements evaluation functions for CGBench's video understanding tasks across multiple domains (Life Record, Music/TV, Driving, etc.) and duration categories. It uses a two-step LLM-based open evaluation approach: first comparing model predictions against ground-truth answers textually, then optionally using visual information from clue intervals for ambiguous cases. Key components include extract_answer_from_item for multiple-choice extraction and system prompts for the LLM judge (sys_prompt_open_eval_step_1, sys_prompt_open_eval_step_2).
Usage
Called internally by the CGBench dataset class during evaluation.
Code Reference
- Source:
vlmeval/dataset/utils/cgbench.py, Lines: L1-620 - Import:
from vlmeval.dataset.utils.cgbench import get_dimension_rating
Key Functions:
def get_dimension_rating(data_path): ...
def check_ans(pred, gt): ...
def evaluate_open_ended(question, response, ground_truth, model): ...
I/O Contract
| Direction | Description |
|---|---|
| Inputs | Model predictions, ground-truth answers, question text, and optionally video frame paths for visual grounding |
| Outputs | Scores (0 or 1) per question; aggregated accuracy by domain, duration, and task type as dictionaries |
Usage Examples
# Internal usage example
from vlmeval.dataset.utils.cgbench import get_dimension_rating
results = get_dimension_rating(data_path)