Implementation:EvolvingLMMs Lab Lmms eval JMMMU Utils

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Multimodal_Learning, Vision_Language_Models, Japanese_NLP, Model_Evaluation
Last Updated	2026-02-14 00:00 GMT

Overview

Task utilities for evaluating multimodal models on the Japanese MMMU (Massive Multi-discipline Multimodal Understanding) benchmark.

Description

This module provides utilities for processing, evaluating, and aggregating results from the Japanese MMMU dataset. It handles both multiple-choice and open-ended questions with Japanese text, supports image-text interleaving, and implements domain-specific accuracy aggregation across Art & Psychology, Business, Science, Health & Medicine, and Tech & Engineering domains. The evaluation logic is adapted from the official MMMU repository with Japanese language-specific modifications for answer parsing and validation.

Usage

Use this module when evaluating multimodal models on the Japanese MMMU benchmark. It provides document-to-text conversion, visual extraction, answer parsing for both question types, and hierarchical accuracy computation across domains and subdomains.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/tasks/jmmmu/utils.py

Signature

def jmmmu_doc_to_text(doc: dict) -> str:
    """Convert document to text prompt with Japanese instructions."""
    ...

def jmmmu_doc_to_visual(doc: dict) -> list:
    """Extract visual content from document."""
    ...

def jmmmu_process_results(doc: dict, results: list) -> dict:
    """Process model predictions and prepare for evaluation."""
    ...

def jmmmu_aggregate_results(results: list) -> float:
    """Aggregate results across domains and compute overall accuracy."""
    ...

# Helper functions
def parse_multi_choice_response(response: str, all_choices: list, index2ans: dict) -> str:
    """Parse multiple choice response with Japanese text handling."""
    ...

def parse_open_response(response: str) -> list:
    """Parse open-ended response and extract answers."""
    ...

def normalize_str(string: str) -> list:
    """Normalize string to handle numbers and text."""
    ...

Import

from lmms_eval.tasks.jmmmu.utils import (
    jmmmu_doc_to_text,
    jmmmu_doc_to_visual,
    jmmmu_process_results,
    jmmmu_aggregate_results,
)

I/O Contract

Inputs

Name	Type	Required	Description
doc	dict	Yes	Document containing question, options, images, answer, question_type
results	list	Yes	Model predictions (for jmmmu_process_results)
results	list[dict]	Yes	List of evaluation samples (for jmmmu_aggregate_results)

Outputs

Name	Type	Description
prompt	str	Japanese prompt with question and instructions
visual	list	List of PIL Image objects
processed	dict	Dictionary with jmmmu_acc and submission keys
accuracy	float	Overall accuracy score (0.0 to 1.0)

Core Functions

Document Processing

jmmmu_doc_to_text(doc)

Constructs Japanese prompt from question
Adds multiple-choice options with A, B, C, D labels
Appends appropriate Japanese instructions:

 * Multiple-choice: "与えられた選択肢の中から最も適切な回答のアルファベットを直接記入してください。"
 * Open-ended: "質問に対する回答を単語や短いフレーズで記入してください。"

Handles image token replacement for interleaved format
Returns formatted prompt string

jmmmu_doc_to_visual(doc)

Extracts image tokens from prompt using regex
Identifies unique image tokens (e.g., <image 1>, <image 2>)
Loads corresponding images from document
Converts images to RGB format
Returns list of PIL Image objects

Result Processing

jmmmu_process_results(doc, results)

Extracts prediction from results
Routes to appropriate parser based on question_type
For multiple-choice: Uses parse_multi_choice_response
For open-ended: Uses parse_open_response
Extracts subdomain from document ID
Returns dictionary with accuracy info and submission

Answer Parsing

parse_multi_choice_response(response, all_choices, index2ans)

Handles multiple answer formats: (A), A, A., A between Japanese chars
Strips Japanese punctuation: 、。！？；：
Uses Japanese character pattern: r"[\u3040-\u30FF\u4E00-\u9FFF]"
Handles edge cases with multiple candidates
Selects last occurrence when multiple matches found
Falls back to random choice if no match found

parse_open_response(response)

Identifies key response sentences using Japanese indicators:

 * よって (therefore), 答えは (the answer is), 最終的に (finally)
 * 解答は (the solution is), 回答は (the response is)

Extracts numbers from response (handles scientific notation, decimals, Japanese counters)
Normalizes extracted values
Returns list of possible answers

Normalization

normalize_str(string)

Checks if string is a number
If number: converts to float, rounds to 2 decimals
If string: converts to lowercase
For single characters: returns with spaces to avoid trivial matches
Returns list of normalized forms

Evaluation

eval_multi_choice(gold_i, pred_i)

Exact match comparison
Handles gold answers as list or string
Returns True only if exact match found

eval_open(gold_i, pred_i)

Normalizes gold and predicted answers
For strings: checks if any normalized gold answer appears in prediction
For numbers: checks exact match after normalization
Returns True if any match found

evaluate_jmmmu(samples)

Batch evaluation for list of samples
Routes to appropriate eval function based on question_type
Computes accuracy as correct / total
Returns judge_dict (per-sample) and metric_dict (accuracy)

Aggregation

jmmmu_aggregate_results(results)

Groups results by subdomain
Evaluates each subdomain separately
Computes domain-level accuracy across subdomains
Calculates weighted average based on sample counts
Returns overall accuracy

calculate_ins_level_acc(results)

Computes instruction-level accuracy
Weighted by number of examples in each category
Returns accuracy across all categories

Domain Structure

The evaluation uses a hierarchical domain structure:

Art and Psychology

Design, Music, Psychology

Business

Accounting, Economics, Finance, Manage, Marketing

Science

Biology, Chemistry, Math, Physics

Health and Medicine

Basic_Medical_Science, Clinical_Medicine
Diagnostics_and_Laboratory_Medicine
Pharmacy, Public_Health

Tech and Engineering

Agriculture, Architecture_and_Engineering
Computer_Science, Electronics
Energy_and_Power, Materials
Mechanical_Engineering

Usage Examples

# Example 1: Convert document to text
doc = {
    "question": "<image1>この画像に何が写っていますか？",
    "question_type": "multiple-choice",
    "options": "['犬', '猫', '鳥', '魚']"
}
prompt = jmmmu_doc_to_text(doc)
# Returns: "<image 1>この画像に何が写っていますか？\nA. 犬\nB. 猫\nC. 鳥\nD. 魚\n\n与えられた選択肢の中から..."

# Example 2: Extract visuals
doc = {
    "question": "<image 1>と<image 2>の違いは？",
    "image_1": PIL_Image_object_1,
    "image_2": PIL_Image_object_2
}
visuals = jmmmu_doc_to_visual(doc)
# Returns: [PIL_Image_object_1_RGB, PIL_Image_object_2_RGB]

# Example 3: Parse multiple choice response
response = "答えはAです。"
all_choices = ["A", "B", "C", "D"]
index2ans = {"A": "犬", "B": "猫", "C": "鳥", "D": "魚"}
parsed = parse_multi_choice_response(response, all_choices, index2ans)
# Returns: "A"

# Example 4: Parse open response
response = "計算すると、3 + 5 = 8です。よって答えは8になります。"
parsed = parse_open_response(response)
# Returns: ['8', 8.0, '計算すると、3 + 5 = 8です。', ...]

# Example 5: Process results
doc = {
    "id": "validation_Math_42",
    "question_type": "multiple-choice",
    "options": "['A', 'B', 'C', 'D']",
    "answer": "B"
}
results = ["答えはBです"]
processed = jmmmu_process_results(doc, results)
# Returns: {
#     "jmmmu_acc": {
#         "id": "validation_Math_42",
#         "subdomain": "Math",
#         "question_type": "multiple-choice",
#         "answer": "B",
#         "parsed_pred": "B"
#     },
#     "submission": {"validation_Math_42": "答えはBです"}
# }

# Example 6: Aggregate results
results = [
    {"subdomain": "Math", "question_type": "multiple-choice",
     "answer": "A", "parsed_pred": "A", "id": "1"},
    {"subdomain": "Math", "question_type": "open",
     "answer": "42", "parsed_pred": ["42", 42.0], "id": "2"}
]
overall_acc = jmmmu_aggregate_results(results)
# Returns: 1.0 (100% accuracy)

Japanese-Specific Features

Text Processing

Handles Japanese punctuation: 、。！？；：
Uses Unicode ranges for Japanese characters:

 * Hiragana: \u3040-\u309F
 * Katakana: \u30A0-\u30FF
 * Kanji: \u4E00-\u9FFF

Extracts numbers with Japanese counters: つ、個、度、円、人、年、匹、台、%

Answer Indicators

よって、よって、(therefore)
答えは、答えは、(the answer is)
最終的に、最終的に、(finally)
解答は、解答は、(the solution is)
回答は、回答は、(the response is)

Number Extraction

Supports multiple number formats:

Commas: 1,234
Scientific notation: 1.5e10
Decimals: 3.14
Japanese counters: 5個、3つ、100円

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment