Implementation:EvolvingLMMs Lab Lmms eval CMMMU Utils
| Knowledge Sources | |
|---|---|
| Domains | Computer Vision, Multimodal Reasoning, Chinese Language Processing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Utility functions for evaluating models on the Chinese MMMU (CMMMU) benchmark with multi-choice, true/false, and fill-in-blank questions.
Description
This module provides the evaluation infrastructure for CMMMU (Chinese Massive Multi-discipline Multimodal Understanding), a Chinese-language benchmark assessing multimodal reasoning across multiple academic disciplines. It includes functions to convert documents into model inputs with Chinese prompts and image placeholders, parse model responses for three question types (multiple choice, true/false, fill-in-blank), and aggregate results across academic domains (艺术与设计, 商业, 科学, 健康与医学, 人文社会科学, 技术与工程). The module handles complex answer extraction including number parsing with Chinese formatting, keyword-based true/false detection, and multi-answer matching for fill-in-blank questions.
Usage
Use this module when evaluating multimodal models on CMMMU benchmark tasks. The workflow involves: (1) using cmmmu_doc_to_text/cmmmu_doc_to_visual to prepare Chinese prompts and images, (2) collecting model responses, (3) using cmmmu_process_results to parse and evaluate answers, (4) using cmmmu_aggregate_results to compute accuracy across domains and overall. For test set submissions, use the test-specific functions to generate submission files.
Code Reference
Source Location
- Repository: EvolvingLMMs_Lab_Lmms_eval
- File: lmms_eval/tasks/cmmmu/utils.py
- Lines: 1-420
Signature
# Document conversion functions
def cmmmu_doc_to_text(doc):
"""Convert document to Chinese text prompt based on question type."""
def cmmmu_doc_to_visual(doc):
"""Extract images referenced in the prompt."""
# Result processing functions
def cmmmu_process_results(doc, results):
"""Process and evaluate single result against ground truth."""
def cmmmu_aggregate_results(results):
"""Aggregate results by domain and subdomain, return overall accuracy."""
# Test submission functions
def cmmmu_process_test_results_for_submission(doc, results):
"""Format test results for submission."""
def cmmmu_test_aggregate_results_for_submission(results, args):
"""Save test results to submission file."""
# Helper functions
def construct_prompt(sample):
"""Build Chinese prompt with question and options."""
def get_multi_choice_prediction(response, all_choices, index2ans):
"""Extract multiple choice answer from response."""
def get_fill_blank_prediction(response, answer):
"""Extract fill-in-blank answers from response."""
def get_TF_prediction(response):
"""Extract true/false answer from response."""
def eval_cmmmu(entries):
"""Evaluate list of entries and compute accuracy."""
Import
from lmms_eval.tasks.cmmmu.utils import (
cmmmu_doc_to_text,
cmmmu_doc_to_visual,
cmmmu_process_results,
cmmmu_aggregate_results,
get_multi_choice_prediction,
get_fill_blank_prediction,
get_TF_prediction
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| doc | dict | Yes | Document with id, question, type (选择/判断/填空), options, answer, subcategory, images |
| results | list | Yes | List with single prediction string from model |
| response | str | Yes | Raw model response for parsing |
| all_choices | list | Yes | List of valid choice labels (e.g., ['A', 'B', 'C', 'D']) |
| index2ans | dict | Yes | Mapping from choice labels to answer text |
| answer | str | Yes | Ground truth answer for comparison |
Outputs
| Name | Type | Description |
|---|---|---|
| doc_to_text return | str | Chinese prompt with question and options |
| doc_to_visual return | list | List of PIL Images in RGB format |
| process_results return | dict | Dict with single key "cmmmu_acc" containing evaluation data |
| aggregate_results return | float | Overall accuracy (0.0-1.0) |
| get_multi_choice_prediction return | str | Predicted choice(s) (e.g., "A", "AB", "ACD") |
| get_fill_blank_prediction return | list | List of normalized predicted answers |
| get_TF_prediction return | list | List of key phrases from response |
| eval_cmmmu return | dict | Dict with correct_num, entries_num, acc |
Usage Examples
# Example 1: Multiple choice question
doc = {
"id": "cmmmu_001",
"type": "选择",
"question": "下图显示的是什么物体?",
"option1": "苹果",
"option2": "橙子",
"option3": "香蕉",
"option4": "葡萄",
"answer": "A",
"subcategory": "生物",
"image_1": PIL.Image.open("image1.jpg"),
"image_1_filename": "image1.jpg"
}
# Convert to prompt
prompt = cmmmu_doc_to_text(doc)
# Returns: "请回答以下多项选择题...\n问题:下图显示的是什么物体?\n选项:\n(A) 苹果\n(B) 橙子\n(C) 香蕉\n(D) 葡萄\n正确答案:\n"
# Get images
images = cmmmu_doc_to_visual(doc)
# Returns: [<PIL.Image.Image>]
# Process results
results = ["答案是A,这是一个苹果"]
processed = cmmmu_process_results(doc, results)
# Returns: {"cmmmu_acc": {"id": "cmmmu_001", "subdomain": "生物",
# "question_type": "选择", "answer": "A", "parsed_pred": "A"}}
# Example 2: Fill-in-blank question
doc_fill = {
"id": "cmmmu_002",
"type": "填空",
"question": "圆周率π约等于多少?",
"answer": "3.14",
"subcategory": "数学",
"image_1": None
}
results = ["π的值大约是3.14159,通常我们取π=3.14"]
processed = cmmmu_process_results(doc_fill, results)
# Parsed prediction will be [3.14] after extraction and normalization
# Example 3: True/False question
doc_tf = {
"id": "cmmmu_003",
"type": "判断",
"question": "地球是平的,这个说法是否正确?",
"answer": "错",
"subcategory": "地理"
}
results = ["这个说法是错误的,地球是圆的"]
processed = cmmmu_process_results(doc_tf, results)
# Will detect "错误" keyword and match against answer "错"
# Example 4: Aggregate results
all_results = [
{"subdomain": "生物", "question_type": "选择", "answer": "A", "parsed_pred": "A"},
{"subdomain": "数学", "question_type": "填空", "answer": "3.14", "parsed_pred": [3.14]},
{"subdomain": "地理", "question_type": "判断", "answer": "错", "parsed_pred": ["错误"]}
]
accuracy = cmmmu_aggregate_results(all_results)
# Returns overall accuracy and prints breakdown by domain