Implementation:EvolvingLMMs Lab Lmms eval MathVision Utils
Location: /tmp/kapso_repo_sslb_59s/lmms_eval/tasks/mathvision/utils.py
Principle: Task_Utility_Functions
Purpose
Task-specific utilities for MathVision benchmark with both standard evaluation and LLM-as-judge scoring for mathematical visual question answering.
Configuration
- API_TYPE: from environment (default: "openai")
- GPT_MODEL: from MODEL_VERSION environment (default: "gpt-4o-2024-11-20")
- ServerConfig initialized for LLM judge
- NUM_SECONDS_TO_SLEEP = 5
Key Functions
mathvision_doc_to_visual
def mathvision_doc_to_visual(doc)
Extracts and converts decoded image to RGB format.
mathvision_doc_to_text
def mathvision_doc_to_text(doc, lmms_eval_specific_kwargs=None)
Formats question with options:
- Constructs multiple-choice options (A, B, C, ...)
- Adds optional mc_prompt from kwargs
- Base prompt: 'Please solve the problem step by step and put your answer in one "\\boxed{}".'
- Appends choices if available
- Returns formatted query prompt
mathvision_gpt_eval_process_results
def mathvision_gpt_eval_process_results(doc, results)
LLM-as-judge evaluation:
- Uses server.evaluate_binary() for each prediction
- Output format: "0/1"
- Compares model answer to ground truth
- Returns average score as "llm_as_judge_eval"
- Logs errors if judge evaluation fails
mathvision_process_results
def mathvision_process_results(doc, results)
Standard evaluation with extensive answer normalization:
Answer Extraction:
- Checks for answer choice format (ABCDE with various delimiters)
- Extracts numeric values after "is "
- Handles \\boxed{} LaTeX format (takes last occurrence if multiple)
- Searches for answer prefixes ("the final answer is", "the answer is", etc.)
- Removes markdown formatting, parentheses, braces
Answer Normalization:
- Uses find_math_answer() from eval_utils
- Removes option formatting: (a), {b}, etc.
- Strips periods and colons
- Compares to ground truth and option values
Returns:
- Dict with "mathvision_standard_eval" containing:
- response: list of predictions
- scores: list of correctness bools
mathvision_aggregate_results_eval
def mathvision_aggregate_results_eval(results)
Aggregates standard evaluation:
- Counts correct predictions (scores[0] == True)
- Computes percentage accuracy
- Returns rounded accuracy (2 decimal places)
Implementation Details
- Dual evaluation modes: standard and LLM judge
- Extensive answer format handling (LaTeX, multiple choice, natural language)
- Imports eval_utils (find_math_answer, is_equal, is_number) with error handling
- Supports questions with or without multiple-choice options
- Case-insensitive option matching (converts (a) to a, etc.)