Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval JMMMU Utils

From Leeroopedia
Revision as of 12:31, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/EvolvingLMMs_Lab_Lmms_eval_JMMMU_Utils.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Multimodal_Learning, Vision_Language_Models, Japanese_NLP, Model_Evaluation
Last Updated 2026-02-14 00:00 GMT

Overview

Task utilities for evaluating multimodal models on the Japanese MMMU (Massive Multi-discipline Multimodal Understanding) benchmark.

Description

This module provides utilities for processing, evaluating, and aggregating results from the Japanese MMMU dataset. It handles both multiple-choice and open-ended questions with Japanese text, supports image-text interleaving, and implements domain-specific accuracy aggregation across Art & Psychology, Business, Science, Health & Medicine, and Tech & Engineering domains. The evaluation logic is adapted from the official MMMU repository with Japanese language-specific modifications for answer parsing and validation.

Usage

Use this module when evaluating multimodal models on the Japanese MMMU benchmark. It provides document-to-text conversion, visual extraction, answer parsing for both question types, and hierarchical accuracy computation across domains and subdomains.

Code Reference

Source Location

Signature

def jmmmu_doc_to_text(doc: dict) -> str:
    """Convert document to text prompt with Japanese instructions."""
    ...

def jmmmu_doc_to_visual(doc: dict) -> list:
    """Extract visual content from document."""
    ...

def jmmmu_process_results(doc: dict, results: list) -> dict:
    """Process model predictions and prepare for evaluation."""
    ...

def jmmmu_aggregate_results(results: list) -> float:
    """Aggregate results across domains and compute overall accuracy."""
    ...

# Helper functions
def parse_multi_choice_response(response: str, all_choices: list, index2ans: dict) -> str:
    """Parse multiple choice response with Japanese text handling."""
    ...

def parse_open_response(response: str) -> list:
    """Parse open-ended response and extract answers."""
    ...

def normalize_str(string: str) -> list:
    """Normalize string to handle numbers and text."""
    ...

Import

from lmms_eval.tasks.jmmmu.utils import (
    jmmmu_doc_to_text,
    jmmmu_doc_to_visual,
    jmmmu_process_results,
    jmmmu_aggregate_results,
)

I/O Contract

Inputs

Name Type Required Description
doc dict Yes Document containing question, options, images, answer, question_type
results list Yes Model predictions (for jmmmu_process_results)
results list[dict] Yes List of evaluation samples (for jmmmu_aggregate_results)

Outputs

Name Type Description
prompt str Japanese prompt with question and instructions
visual list List of PIL Image objects
processed dict Dictionary with jmmmu_acc and submission keys
accuracy float Overall accuracy score (0.0 to 1.0)

Core Functions

Document Processing

jmmmu_doc_to_text(doc)

  • Constructs Japanese prompt from question
  • Adds multiple-choice options with A, B, C, D labels
  • Appends appropriate Japanese instructions:
 * Multiple-choice: "与えられた選択肢の中から最も適切な回答のアルファベットを直接記入してください。"
 * Open-ended: "質問に対する回答を単語や短いフレーズで記入してください。"
  • Handles image token replacement for interleaved format
  • Returns formatted prompt string

jmmmu_doc_to_visual(doc)

  • Extracts image tokens from prompt using regex
  • Identifies unique image tokens (e.g., <image 1>, <image 2>)
  • Loads corresponding images from document
  • Converts images to RGB format
  • Returns list of PIL Image objects

Result Processing

jmmmu_process_results(doc, results)

  • Extracts prediction from results
  • Routes to appropriate parser based on question_type
  • For multiple-choice: Uses parse_multi_choice_response
  • For open-ended: Uses parse_open_response
  • Extracts subdomain from document ID
  • Returns dictionary with accuracy info and submission

Answer Parsing

parse_multi_choice_response(response, all_choices, index2ans)

  • Handles multiple answer formats: (A), A, A., A between Japanese chars
  • Strips Japanese punctuation: 、。!?;:
  • Uses Japanese character pattern: r"[\u3040-\u30FF\u4E00-\u9FFF]"
  • Handles edge cases with multiple candidates
  • Selects last occurrence when multiple matches found
  • Falls back to random choice if no match found

parse_open_response(response)

  • Identifies key response sentences using Japanese indicators:
 * よって (therefore), 答えは (the answer is), 最終的に (finally)
 * 解答は (the solution is), 回答は (the response is)
  • Extracts numbers from response (handles scientific notation, decimals, Japanese counters)
  • Normalizes extracted values
  • Returns list of possible answers

Normalization

normalize_str(string)

  • Checks if string is a number
  • If number: converts to float, rounds to 2 decimals
  • If string: converts to lowercase
  • For single characters: returns with spaces to avoid trivial matches
  • Returns list of normalized forms

Evaluation

eval_multi_choice(gold_i, pred_i)

  • Exact match comparison
  • Handles gold answers as list or string
  • Returns True only if exact match found

eval_open(gold_i, pred_i)

  • Normalizes gold and predicted answers
  • For strings: checks if any normalized gold answer appears in prediction
  • For numbers: checks exact match after normalization
  • Returns True if any match found

evaluate_jmmmu(samples)

  • Batch evaluation for list of samples
  • Routes to appropriate eval function based on question_type
  • Computes accuracy as correct / total
  • Returns judge_dict (per-sample) and metric_dict (accuracy)

Aggregation

jmmmu_aggregate_results(results)

  • Groups results by subdomain
  • Evaluates each subdomain separately
  • Computes domain-level accuracy across subdomains
  • Calculates weighted average based on sample counts
  • Returns overall accuracy

calculate_ins_level_acc(results)

  • Computes instruction-level accuracy
  • Weighted by number of examples in each category
  • Returns accuracy across all categories

Domain Structure

The evaluation uses a hierarchical domain structure:

Art and Psychology

  • Design, Music, Psychology

Business

  • Accounting, Economics, Finance, Manage, Marketing

Science

  • Biology, Chemistry, Math, Physics

Health and Medicine

  • Basic_Medical_Science, Clinical_Medicine
  • Diagnostics_and_Laboratory_Medicine
  • Pharmacy, Public_Health

Tech and Engineering

  • Agriculture, Architecture_and_Engineering
  • Computer_Science, Electronics
  • Energy_and_Power, Materials
  • Mechanical_Engineering

Usage Examples

# Example 1: Convert document to text
doc = {
    "question": "<image1>この画像に何が写っていますか?",
    "question_type": "multiple-choice",
    "options": "['犬', '猫', '鳥', '魚']"
}
prompt = jmmmu_doc_to_text(doc)
# Returns: "<image 1>この画像に何が写っていますか?\nA. 犬\nB. 猫\nC. 鳥\nD. 魚\n\n与えられた選択肢の中から..."

# Example 2: Extract visuals
doc = {
    "question": "<image 1>と<image 2>の違いは?",
    "image_1": PIL_Image_object_1,
    "image_2": PIL_Image_object_2
}
visuals = jmmmu_doc_to_visual(doc)
# Returns: [PIL_Image_object_1_RGB, PIL_Image_object_2_RGB]

# Example 3: Parse multiple choice response
response = "答えはAです。"
all_choices = ["A", "B", "C", "D"]
index2ans = {"A": "犬", "B": "猫", "C": "鳥", "D": "魚"}
parsed = parse_multi_choice_response(response, all_choices, index2ans)
# Returns: "A"

# Example 4: Parse open response
response = "計算すると、3 + 5 = 8です。よって答えは8になります。"
parsed = parse_open_response(response)
# Returns: ['8', 8.0, '計算すると、3 + 5 = 8です。', ...]

# Example 5: Process results
doc = {
    "id": "validation_Math_42",
    "question_type": "multiple-choice",
    "options": "['A', 'B', 'C', 'D']",
    "answer": "B"
}
results = ["答えはBです"]
processed = jmmmu_process_results(doc, results)
# Returns: {
#     "jmmmu_acc": {
#         "id": "validation_Math_42",
#         "subdomain": "Math",
#         "question_type": "multiple-choice",
#         "answer": "B",
#         "parsed_pred": "B"
#     },
#     "submission": {"validation_Math_42": "答えはBです"}
# }

# Example 6: Aggregate results
results = [
    {"subdomain": "Math", "question_type": "multiple-choice",
     "answer": "A", "parsed_pred": "A", "id": "1"},
    {"subdomain": "Math", "question_type": "open",
     "answer": "42", "parsed_pred": ["42", 42.0], "id": "2"}
]
overall_acc = jmmmu_aggregate_results(results)
# Returns: 1.0 (100% accuracy)

Japanese-Specific Features

Text Processing

  • Handles Japanese punctuation: 、。!?;:
  • Uses Unicode ranges for Japanese characters:
 * Hiragana: \u3040-\u309F
 * Katakana: \u30A0-\u30FF
 * Kanji: \u4E00-\u9FFF
  • Extracts numbers with Japanese counters: つ、個、度、円、人、年、匹、台、%

Answer Indicators

  • よって、よって、(therefore)
  • 答えは、答えは、(the answer is)
  • 最終的に、最終的に、(finally)
  • 解答は、解答は、(the solution is)
  • 回答は、回答は、(the response is)

Number Extraction

Supports multiple number formats:

  • Commas: 1,234
  • Scientific notation: 1.5e10
  • Decimals: 3.14
  • Japanese counters: 5個、3つ、100円

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment