Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval MathVision Utils

From Leeroopedia
Revision as of 12:31, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/EvolvingLMMs_Lab_Lmms_eval_MathVision_Utils.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Location: /tmp/kapso_repo_sslb_59s/lmms_eval/tasks/mathvision/utils.py

Principle: Task_Utility_Functions

Purpose

Task-specific utilities for MathVision benchmark with both standard evaluation and LLM-as-judge scoring for mathematical visual question answering.

Configuration

  • API_TYPE: from environment (default: "openai")
  • GPT_MODEL: from MODEL_VERSION environment (default: "gpt-4o-2024-11-20")
  • ServerConfig initialized for LLM judge
  • NUM_SECONDS_TO_SLEEP = 5

Key Functions

mathvision_doc_to_visual

def mathvision_doc_to_visual(doc)

Extracts and converts decoded image to RGB format.

mathvision_doc_to_text

def mathvision_doc_to_text(doc, lmms_eval_specific_kwargs=None)

Formats question with options:

  • Constructs multiple-choice options (A, B, C, ...)
  • Adds optional mc_prompt from kwargs
  • Base prompt: 'Please solve the problem step by step and put your answer in one "\\boxed{}".'
  • Appends choices if available
  • Returns formatted query prompt

mathvision_gpt_eval_process_results

def mathvision_gpt_eval_process_results(doc, results)

LLM-as-judge evaluation:

  • Uses server.evaluate_binary() for each prediction
  • Output format: "0/1"
  • Compares model answer to ground truth
  • Returns average score as "llm_as_judge_eval"
  • Logs errors if judge evaluation fails

mathvision_process_results

def mathvision_process_results(doc, results)

Standard evaluation with extensive answer normalization:

Answer Extraction:

  • Checks for answer choice format (ABCDE with various delimiters)
  • Extracts numeric values after "is "
  • Handles \\boxed{} LaTeX format (takes last occurrence if multiple)
  • Searches for answer prefixes ("the final answer is", "the answer is", etc.)
  • Removes markdown formatting, parentheses, braces

Answer Normalization:

  • Uses find_math_answer() from eval_utils
  • Removes option formatting: (a), {b}, etc.
  • Strips periods and colons
  • Compares to ground truth and option values

Returns:

  • Dict with "mathvision_standard_eval" containing:
    • response: list of predictions
    • scores: list of correctness bools

mathvision_aggregate_results_eval

def mathvision_aggregate_results_eval(results)

Aggregates standard evaluation:

  • Counts correct predictions (scores[0] == True)
  • Computes percentage accuracy
  • Returns rounded accuracy (2 decimal places)

Implementation Details

  • Dual evaluation modes: standard and LLM judge
  • Extensive answer format handling (LaTeX, multiple choice, natural language)
  • Imports eval_utils (find_math_answer, is_equal, is_number) with error handling
  • Supports questions with or without multiple-choice options
  • Case-insensitive option matching (converts (a) to a, etc.)

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment