Implementation:EvolvingLMMs Lab Lmms eval JMMMU Utils
| Knowledge Sources | |
|---|---|
| Domains | Multimodal_Learning, Vision_Language_Models, Japanese_NLP, Model_Evaluation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Task utilities for evaluating multimodal models on the Japanese MMMU (Massive Multi-discipline Multimodal Understanding) benchmark.
Description
This module provides utilities for processing, evaluating, and aggregating results from the Japanese MMMU dataset. It handles both multiple-choice and open-ended questions with Japanese text, supports image-text interleaving, and implements domain-specific accuracy aggregation across Art & Psychology, Business, Science, Health & Medicine, and Tech & Engineering domains. The evaluation logic is adapted from the official MMMU repository with Japanese language-specific modifications for answer parsing and validation.
Usage
Use this module when evaluating multimodal models on the Japanese MMMU benchmark. It provides document-to-text conversion, visual extraction, answer parsing for both question types, and hierarchical accuracy computation across domains and subdomains.
Code Reference
Source Location
- Repository: EvolvingLMMs_Lab_Lmms_eval
- File: lmms_eval/tasks/jmmmu/utils.py
Signature
def jmmmu_doc_to_text(doc: dict) -> str:
"""Convert document to text prompt with Japanese instructions."""
...
def jmmmu_doc_to_visual(doc: dict) -> list:
"""Extract visual content from document."""
...
def jmmmu_process_results(doc: dict, results: list) -> dict:
"""Process model predictions and prepare for evaluation."""
...
def jmmmu_aggregate_results(results: list) -> float:
"""Aggregate results across domains and compute overall accuracy."""
...
# Helper functions
def parse_multi_choice_response(response: str, all_choices: list, index2ans: dict) -> str:
"""Parse multiple choice response with Japanese text handling."""
...
def parse_open_response(response: str) -> list:
"""Parse open-ended response and extract answers."""
...
def normalize_str(string: str) -> list:
"""Normalize string to handle numbers and text."""
...
Import
from lmms_eval.tasks.jmmmu.utils import (
jmmmu_doc_to_text,
jmmmu_doc_to_visual,
jmmmu_process_results,
jmmmu_aggregate_results,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| doc | dict | Yes | Document containing question, options, images, answer, question_type |
| results | list | Yes | Model predictions (for jmmmu_process_results) |
| results | list[dict] | Yes | List of evaluation samples (for jmmmu_aggregate_results) |
Outputs
| Name | Type | Description |
|---|---|---|
| prompt | str | Japanese prompt with question and instructions |
| visual | list | List of PIL Image objects |
| processed | dict | Dictionary with jmmmu_acc and submission keys |
| accuracy | float | Overall accuracy score (0.0 to 1.0) |
Core Functions
Document Processing
jmmmu_doc_to_text(doc)
- Constructs Japanese prompt from question
- Adds multiple-choice options with A, B, C, D labels
- Appends appropriate Japanese instructions:
* Multiple-choice: "与えられた選択肢の中から最も適切な回答のアルファベットを直接記入してください。" * Open-ended: "質問に対する回答を単語や短いフレーズで記入してください。"
- Handles image token replacement for interleaved format
- Returns formatted prompt string
jmmmu_doc_to_visual(doc)
- Extracts image tokens from prompt using regex
- Identifies unique image tokens (e.g., <image 1>, <image 2>)
- Loads corresponding images from document
- Converts images to RGB format
- Returns list of PIL Image objects
Result Processing
jmmmu_process_results(doc, results)
- Extracts prediction from results
- Routes to appropriate parser based on question_type
- For multiple-choice: Uses parse_multi_choice_response
- For open-ended: Uses parse_open_response
- Extracts subdomain from document ID
- Returns dictionary with accuracy info and submission
Answer Parsing
parse_multi_choice_response(response, all_choices, index2ans)
- Handles multiple answer formats: (A), A, A., A between Japanese chars
- Strips Japanese punctuation: 、。!?;:
- Uses Japanese character pattern: r"[\u3040-\u30FF\u4E00-\u9FFF]"
- Handles edge cases with multiple candidates
- Selects last occurrence when multiple matches found
- Falls back to random choice if no match found
parse_open_response(response)
- Identifies key response sentences using Japanese indicators:
* よって (therefore), 答えは (the answer is), 最終的に (finally) * 解答は (the solution is), 回答は (the response is)
- Extracts numbers from response (handles scientific notation, decimals, Japanese counters)
- Normalizes extracted values
- Returns list of possible answers
Normalization
normalize_str(string)
- Checks if string is a number
- If number: converts to float, rounds to 2 decimals
- If string: converts to lowercase
- For single characters: returns with spaces to avoid trivial matches
- Returns list of normalized forms
Evaluation
eval_multi_choice(gold_i, pred_i)
- Exact match comparison
- Handles gold answers as list or string
- Returns True only if exact match found
eval_open(gold_i, pred_i)
- Normalizes gold and predicted answers
- For strings: checks if any normalized gold answer appears in prediction
- For numbers: checks exact match after normalization
- Returns True if any match found
evaluate_jmmmu(samples)
- Batch evaluation for list of samples
- Routes to appropriate eval function based on question_type
- Computes accuracy as correct / total
- Returns judge_dict (per-sample) and metric_dict (accuracy)
Aggregation
jmmmu_aggregate_results(results)
- Groups results by subdomain
- Evaluates each subdomain separately
- Computes domain-level accuracy across subdomains
- Calculates weighted average based on sample counts
- Returns overall accuracy
calculate_ins_level_acc(results)
- Computes instruction-level accuracy
- Weighted by number of examples in each category
- Returns accuracy across all categories
Domain Structure
The evaluation uses a hierarchical domain structure:
Art and Psychology
- Design, Music, Psychology
Business
- Accounting, Economics, Finance, Manage, Marketing
Science
- Biology, Chemistry, Math, Physics
Health and Medicine
- Basic_Medical_Science, Clinical_Medicine
- Diagnostics_and_Laboratory_Medicine
- Pharmacy, Public_Health
Tech and Engineering
- Agriculture, Architecture_and_Engineering
- Computer_Science, Electronics
- Energy_and_Power, Materials
- Mechanical_Engineering
Usage Examples
# Example 1: Convert document to text
doc = {
"question": "<image1>この画像に何が写っていますか?",
"question_type": "multiple-choice",
"options": "['犬', '猫', '鳥', '魚']"
}
prompt = jmmmu_doc_to_text(doc)
# Returns: "<image 1>この画像に何が写っていますか?\nA. 犬\nB. 猫\nC. 鳥\nD. 魚\n\n与えられた選択肢の中から..."
# Example 2: Extract visuals
doc = {
"question": "<image 1>と<image 2>の違いは?",
"image_1": PIL_Image_object_1,
"image_2": PIL_Image_object_2
}
visuals = jmmmu_doc_to_visual(doc)
# Returns: [PIL_Image_object_1_RGB, PIL_Image_object_2_RGB]
# Example 3: Parse multiple choice response
response = "答えはAです。"
all_choices = ["A", "B", "C", "D"]
index2ans = {"A": "犬", "B": "猫", "C": "鳥", "D": "魚"}
parsed = parse_multi_choice_response(response, all_choices, index2ans)
# Returns: "A"
# Example 4: Parse open response
response = "計算すると、3 + 5 = 8です。よって答えは8になります。"
parsed = parse_open_response(response)
# Returns: ['8', 8.0, '計算すると、3 + 5 = 8です。', ...]
# Example 5: Process results
doc = {
"id": "validation_Math_42",
"question_type": "multiple-choice",
"options": "['A', 'B', 'C', 'D']",
"answer": "B"
}
results = ["答えはBです"]
processed = jmmmu_process_results(doc, results)
# Returns: {
# "jmmmu_acc": {
# "id": "validation_Math_42",
# "subdomain": "Math",
# "question_type": "multiple-choice",
# "answer": "B",
# "parsed_pred": "B"
# },
# "submission": {"validation_Math_42": "答えはBです"}
# }
# Example 6: Aggregate results
results = [
{"subdomain": "Math", "question_type": "multiple-choice",
"answer": "A", "parsed_pred": "A", "id": "1"},
{"subdomain": "Math", "question_type": "open",
"answer": "42", "parsed_pred": ["42", 42.0], "id": "2"}
]
overall_acc = jmmmu_aggregate_results(results)
# Returns: 1.0 (100% accuracy)
Japanese-Specific Features
Text Processing
- Handles Japanese punctuation: 、。!?;:
- Uses Unicode ranges for Japanese characters:
* Hiragana: \u3040-\u309F * Katakana: \u30A0-\u30FF * Kanji: \u4E00-\u9FFF
- Extracts numbers with Japanese counters: つ、個、度、円、人、年、匹、台、%
Answer Indicators
- よって、よって、(therefore)
- 答えは、答えは、(the answer is)
- 最終的に、最終的に、(finally)
- 解答は、解答は、(the solution is)
- 回答は、回答は、(the response is)
Number Extraction
Supports multiple number formats:
- Commas: 1,234
- Scientific notation: 1.5e10
- Decimals: 3.14
- Japanese counters: 5個、3つ、100円