Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval VMCBench Utils

From Leeroopedia

Source File: `lmms_eval/tasks/vmcbench/utils.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The VMCBench Utils module provides evaluation functions for the VMCBench (Visual Multi-Choice Benchmark), which aggregates multiple existing benchmarks into a unified evaluation framework. It handles multiple-choice question parsing, answer extraction with multiple strategies, and category-based performance analysis across general, reasoning, OCR, and document understanding tasks.

Key Functions

Document Processing

vmcbench_doc_to_visual(doc)
Prepares image for model input
  • Converts document image to RGB format
  • Returns list containing single image
vmcbench_doc_to_text(doc, lmms_eval_specific_kwargs=None)
Formats multiple-choice question with options
  • Extracts question from document
  • Constructs options dictionary from fields A, B, C, D
  • Formats options as numbered list
  • Applies optional pre_prompt and post_prompt from kwargs
  • Returns formatted question string with options

Answer Parsing

parse_multi_choice_response(response, all_choices, index2ans)
Parses prediction to extract answer choice
  • Normalizes response by removing punctuation and adding spaces
  • Uses multiple extraction strategies in order:
  1. Bracketed format: Searches for "(A)", "(B)", etc.
  2. Dotted format: Searches for "A. ", "B. ", etc.
  3. Spaced format: Searches for " A ", " B ", etc.
  4. Content matching: Searches for full answer text in response
  5. Random fallback: If no match found, randomly selects a choice
  • Handles multiple candidates by finding last occurrence
  • Tracks whether index or content was matched
  • Returns predicted choice letter

Results Processing

vmcbench_process_results(doc, results)
Processes model prediction and maps to categories
  • Extracts response from results list
  • Creates all_choices list: ['A', 'B', 'C', 'D']
  • Builds index2ans mapping from document
  • Parses response to extract predicted choice
  • Compares with ground truth answer
  • Computes binary score (1 or 0)
  • Maps dataset category to main category using datasets_category_map
  • Returns dictionary with:
    • Main category metric (general, reason, ocr, doc)
    • Average metric (across all categories)
    • Each containing question ID, category, and score

Aggregation

vmcbench_aggregate_results(results)
Aggregates scores across samples
  • Extracts score from each result dictionary
  • Computes mean score
  • Returns average accuracy

Category Mapping

The module maps individual benchmark datasets to four main categories:

General Category

  • SEEDBench, MMStar, A-OKVQA
  • VizWiz, MMVet, VQAv2, OKVQA

Reasoning Category

  • MMMU, MathVista, ScienceQA
  • RealWorldQA, GQA, MathVision

OCR Category

  • TextVQA, OCRVQA

Document Understanding Category

  • AI2D, ChartQA, DocVQA
  • InfoVQA, TableVQABench

Parsing Strategy

The parse_multi_choice_response function implements a waterfall strategy:

  1. Format-based matching: Looks for explicit answer markers
    • Bracketed: "(A)", "(B)", "(C)", "(D)"
    • Dotted: "A. ", "B. ", "C. ", "D. "
    • Spaced: " A ", " B ", " C ", " D "
  2. Content-based matching: Searches for full answer text (case-insensitive)
  3. Conflict resolution: If multiple candidates found, uses last occurrence
  4. Random fallback: Ensures all samples have a prediction

This multi-strategy approach handles diverse model output formats while maintaining consistency.

Design Characteristics

  • Multi-Benchmark Aggregation: Unifies evaluation across 20 different datasets
  • Robust Parsing: Multiple strategies for extracting answers from free-form responses
  • Category Analysis: Groups datasets by capability type for targeted analysis
  • Fallback Handling: Ensures all samples produce predictions via random selection
  • Position-Aware: Uses last occurrence when multiple answer markers present
  • Flexible Prompting: Supports pre/post prompts for experimental variations

Dependencies

  • os - File system operations
  • random - Random fallback selection
  • numpy - Array operations for finding last occurrence

Usage Context

This module supports VMCBench, a comprehensive benchmark that evaluates vision-language models across diverse capabilities. By aggregating multiple existing benchmarks and providing category-based analysis, it enables broad assessment of model strengths and weaknesses across general understanding, reasoning, OCR, and document comprehension tasks.

Datasets Category Map

datasets_category_map = {
    "SEEDBench": "general", "MMStar": "general",
    "A-OKVQA": "general", "VizWiz": "general",
    "MMVet": "general", "VQAv2": "general", "OKVQA": "general",
    "MMMU": "reason", "MathVista": "reason",
    "ScienceQA": "reason", "RealWorldQA": "reason",
    "GQA": "reason", "MathVision": "reason",
    "TextVQA": "ocr", "OCRVQA": "ocr",
    "AI2D": "doc", "ChartQA": "doc", "DocVQA": "doc",
    "InfoVQA": "doc", "TableVQABench": "doc",
}

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment