Implementation:EvolvingLMMs Lab Lmms eval VMCBench Utils

Source File: `lmms_eval/tasks/vmcbench/utils.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The VMCBench Utils module provides evaluation functions for the VMCBench (Visual Multi-Choice Benchmark), which aggregates multiple existing benchmarks into a unified evaluation framework. It handles multiple-choice question parsing, answer extraction with multiple strategies, and category-based performance analysis across general, reasoning, OCR, and document understanding tasks.

Key Functions

Document Processing

vmcbench_doc_to_visual(doc)

Prepares image for model input

Converts document image to RGB format
Returns list containing single image

vmcbench_doc_to_text(doc, lmms_eval_specific_kwargs=None)

Formats multiple-choice question with options

Extracts question from document
Constructs options dictionary from fields A, B, C, D
Formats options as numbered list
Applies optional pre_prompt and post_prompt from kwargs
Returns formatted question string with options

Answer Parsing

parse_multi_choice_response(response, all_choices, index2ans)

Parses prediction to extract answer choice

Normalizes response by removing punctuation and adding spaces
Uses multiple extraction strategies in order:

Bracketed format: Searches for "(A)", "(B)", etc.
Dotted format: Searches for "A. ", "B. ", etc.
Spaced format: Searches for " A ", " B ", etc.
Content matching: Searches for full answer text in response
Random fallback: If no match found, randomly selects a choice

Handles multiple candidates by finding last occurrence
Tracks whether index or content was matched
Returns predicted choice letter

Results Processing

vmcbench_process_results(doc, results)

Processes model prediction and maps to categories

Extracts response from results list
Creates all_choices list: ['A', 'B', 'C', 'D']
Builds index2ans mapping from document
Parses response to extract predicted choice
Compares with ground truth answer
Computes binary score (1 or 0)
Maps dataset category to main category using datasets_category_map
Returns dictionary with:
- Main category metric (general, reason, ocr, doc)
- Average metric (across all categories)
- Each containing question ID, category, and score

Aggregation

vmcbench_aggregate_results(results)

Aggregates scores across samples

Extracts score from each result dictionary
Computes mean score
Returns average accuracy

Category Mapping

The module maps individual benchmark datasets to four main categories:

General Category

SEEDBench, MMStar, A-OKVQA
VizWiz, MMVet, VQAv2, OKVQA

Reasoning Category

MMMU, MathVista, ScienceQA
RealWorldQA, GQA, MathVision

OCR Category

TextVQA, OCRVQA

Document Understanding Category

AI2D, ChartQA, DocVQA
InfoVQA, TableVQABench

Parsing Strategy

The parse_multi_choice_response function implements a waterfall strategy:

Format-based matching: Looks for explicit answer markers
- Bracketed: "(A)", "(B)", "(C)", "(D)"
- Dotted: "A. ", "B. ", "C. ", "D. "
- Spaced: " A ", " B ", " C ", " D "
Content-based matching: Searches for full answer text (case-insensitive)
Conflict resolution: If multiple candidates found, uses last occurrence
Random fallback: Ensures all samples have a prediction

This multi-strategy approach handles diverse model output formats while maintaining consistency.

Design Characteristics

Multi-Benchmark Aggregation: Unifies evaluation across 20 different datasets
Robust Parsing: Multiple strategies for extracting answers from free-form responses
Category Analysis: Groups datasets by capability type for targeted analysis
Fallback Handling: Ensures all samples produce predictions via random selection
Position-Aware: Uses last occurrence when multiple answer markers present
Flexible Prompting: Supports pre/post prompts for experimental variations

Dependencies

os - File system operations
random - Random fallback selection
numpy - Array operations for finding last occurrence

Usage Context

This module supports VMCBench, a comprehensive benchmark that evaluates vision-language models across diverse capabilities. By aggregating multiple existing benchmarks and providing category-based analysis, it enables broad assessment of model strengths and weaknesses across general understanding, reasoning, OCR, and document comprehension tasks.

Datasets Category Map

datasets_category_map = {
    "SEEDBench": "general", "MMStar": "general",
    "A-OKVQA": "general", "VizWiz": "general",
    "MMVet": "general", "VQAv2": "general", "OKVQA": "general",
    "MMMU": "reason", "MathVista": "reason",
    "ScienceQA": "reason", "RealWorldQA": "reason",
    "GQA": "reason", "MathVision": "reason",
    "TextVQA": "ocr", "OCRVQA": "ocr",
    "AI2D": "doc", "ChartQA": "doc", "DocVQA": "doc",
    "InfoVQA": "doc", "TableVQABench": "doc",
}

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment