Implementation:EvolvingLMMs Lab Lmms eval VMCBench Utils
Source File: `lmms_eval/tasks/vmcbench/utils.py`
Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]
Overview
The VMCBench Utils module provides evaluation functions for the VMCBench (Visual Multi-Choice Benchmark), which aggregates multiple existing benchmarks into a unified evaluation framework. It handles multiple-choice question parsing, answer extraction with multiple strategies, and category-based performance analysis across general, reasoning, OCR, and document understanding tasks.
Key Functions
Document Processing
vmcbench_doc_to_visual(doc)- Prepares image for model input
- Converts document image to RGB format
- Returns list containing single image
vmcbench_doc_to_text(doc, lmms_eval_specific_kwargs=None)- Formats multiple-choice question with options
- Extracts question from document
- Constructs options dictionary from fields A, B, C, D
- Formats options as numbered list
- Applies optional pre_prompt and post_prompt from kwargs
- Returns formatted question string with options
Answer Parsing
parse_multi_choice_response(response, all_choices, index2ans)- Parses prediction to extract answer choice
- Normalizes response by removing punctuation and adding spaces
- Uses multiple extraction strategies in order:
- Bracketed format: Searches for "(A)", "(B)", etc.
- Dotted format: Searches for "A. ", "B. ", etc.
- Spaced format: Searches for " A ", " B ", etc.
- Content matching: Searches for full answer text in response
- Random fallback: If no match found, randomly selects a choice
- Handles multiple candidates by finding last occurrence
- Tracks whether index or content was matched
- Returns predicted choice letter
Results Processing
vmcbench_process_results(doc, results)- Processes model prediction and maps to categories
- Extracts response from results list
- Creates all_choices list: ['A', 'B', 'C', 'D']
- Builds index2ans mapping from document
- Parses response to extract predicted choice
- Compares with ground truth answer
- Computes binary score (1 or 0)
- Maps dataset category to main category using
datasets_category_map - Returns dictionary with:
- Main category metric (general, reason, ocr, doc)
- Average metric (across all categories)
- Each containing question ID, category, and score
Aggregation
vmcbench_aggregate_results(results)- Aggregates scores across samples
- Extracts score from each result dictionary
- Computes mean score
- Returns average accuracy
Category Mapping
The module maps individual benchmark datasets to four main categories:
General Category
- SEEDBench, MMStar, A-OKVQA
- VizWiz, MMVet, VQAv2, OKVQA
Reasoning Category
- MMMU, MathVista, ScienceQA
- RealWorldQA, GQA, MathVision
OCR Category
- TextVQA, OCRVQA
Document Understanding Category
- AI2D, ChartQA, DocVQA
- InfoVQA, TableVQABench
Parsing Strategy
The parse_multi_choice_response function implements a waterfall strategy:
- Format-based matching: Looks for explicit answer markers
- Bracketed: "(A)", "(B)", "(C)", "(D)"
- Dotted: "A. ", "B. ", "C. ", "D. "
- Spaced: " A ", " B ", " C ", " D "
- Content-based matching: Searches for full answer text (case-insensitive)
- Conflict resolution: If multiple candidates found, uses last occurrence
- Random fallback: Ensures all samples have a prediction
This multi-strategy approach handles diverse model output formats while maintaining consistency.
Design Characteristics
- Multi-Benchmark Aggregation: Unifies evaluation across 20 different datasets
- Robust Parsing: Multiple strategies for extracting answers from free-form responses
- Category Analysis: Groups datasets by capability type for targeted analysis
- Fallback Handling: Ensures all samples produce predictions via random selection
- Position-Aware: Uses last occurrence when multiple answer markers present
- Flexible Prompting: Supports pre/post prompts for experimental variations
Dependencies
os- File system operationsrandom- Random fallback selectionnumpy- Array operations for finding last occurrence
Usage Context
This module supports VMCBench, a comprehensive benchmark that evaluates vision-language models across diverse capabilities. By aggregating multiple existing benchmarks and providing category-based analysis, it enables broad assessment of model strengths and weaknesses across general understanding, reasoning, OCR, and document comprehension tasks.
Datasets Category Map
datasets_category_map = {
"SEEDBench": "general", "MMStar": "general",
"A-OKVQA": "general", "VizWiz": "general",
"MMVet": "general", "VQAv2": "general", "OKVQA": "general",
"MMMU": "reason", "MathVista": "reason",
"ScienceQA": "reason", "RealWorldQA": "reason",
"GQA": "reason", "MathVision": "reason",
"TextVQA": "ocr", "OCRVQA": "ocr",
"AI2D": "doc", "ChartQA": "doc", "DocVQA": "doc",
"InfoVQA": "doc", "TableVQABench": "doc",
}