Implementation:EvolvingLMMs Lab Lmms eval CMMMU Utils

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Computer Vision, Multimodal Reasoning, Chinese Language Processing
Last Updated	2026-02-14 00:00 GMT

Overview

Utility functions for evaluating models on the Chinese MMMU (CMMMU) benchmark with multi-choice, true/false, and fill-in-blank questions.

Description

This module provides the evaluation infrastructure for CMMMU (Chinese Massive Multi-discipline Multimodal Understanding), a Chinese-language benchmark assessing multimodal reasoning across multiple academic disciplines. It includes functions to convert documents into model inputs with Chinese prompts and image placeholders, parse model responses for three question types (multiple choice, true/false, fill-in-blank), and aggregate results across academic domains (艺术与设计, 商业, 科学, 健康与医学, 人文社会科学, 技术与工程). The module handles complex answer extraction including number parsing with Chinese formatting, keyword-based true/false detection, and multi-answer matching for fill-in-blank questions.

Usage

Use this module when evaluating multimodal models on CMMMU benchmark tasks. The workflow involves: (1) using cmmmu_doc_to_text/cmmmu_doc_to_visual to prepare Chinese prompts and images, (2) collecting model responses, (3) using cmmmu_process_results to parse and evaluate answers, (4) using cmmmu_aggregate_results to compute accuracy across domains and overall. For test set submissions, use the test-specific functions to generate submission files.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/tasks/cmmmu/utils.py
Lines: 1-420

Signature

# Document conversion functions
def cmmmu_doc_to_text(doc):
    """Convert document to Chinese text prompt based on question type."""

def cmmmu_doc_to_visual(doc):
    """Extract images referenced in the prompt."""

# Result processing functions
def cmmmu_process_results(doc, results):
    """Process and evaluate single result against ground truth."""

def cmmmu_aggregate_results(results):
    """Aggregate results by domain and subdomain, return overall accuracy."""

# Test submission functions
def cmmmu_process_test_results_for_submission(doc, results):
    """Format test results for submission."""

def cmmmu_test_aggregate_results_for_submission(results, args):
    """Save test results to submission file."""

# Helper functions
def construct_prompt(sample):
    """Build Chinese prompt with question and options."""

def get_multi_choice_prediction(response, all_choices, index2ans):
    """Extract multiple choice answer from response."""

def get_fill_blank_prediction(response, answer):
    """Extract fill-in-blank answers from response."""

def get_TF_prediction(response):
    """Extract true/false answer from response."""

def eval_cmmmu(entries):
    """Evaluate list of entries and compute accuracy."""

Import

from lmms_eval.tasks.cmmmu.utils import (
    cmmmu_doc_to_text,
    cmmmu_doc_to_visual,
    cmmmu_process_results,
    cmmmu_aggregate_results,
    get_multi_choice_prediction,
    get_fill_blank_prediction,
    get_TF_prediction
)

I/O Contract

Inputs

Name	Type	Required	Description
doc	dict	Yes	Document with id, question, type (选择/判断/填空), options, answer, subcategory, images
results	list	Yes	List with single prediction string from model
response	str	Yes	Raw model response for parsing
all_choices	list	Yes	List of valid choice labels (e.g., ['A', 'B', 'C', 'D'])
index2ans	dict	Yes	Mapping from choice labels to answer text
answer	str	Yes	Ground truth answer for comparison

Outputs

Name	Type	Description
doc_to_text return	str	Chinese prompt with question and options
doc_to_visual return	list	List of PIL Images in RGB format
process_results return	dict	Dict with single key "cmmmu_acc" containing evaluation data
aggregate_results return	float	Overall accuracy (0.0-1.0)
get_multi_choice_prediction return	str	Predicted choice(s) (e.g., "A", "AB", "ACD")
get_fill_blank_prediction return	list	List of normalized predicted answers
get_TF_prediction return	list	List of key phrases from response
eval_cmmmu return	dict	Dict with correct_num, entries_num, acc

Usage Examples

# Example 1: Multiple choice question
doc = {
    "id": "cmmmu_001",
    "type": "选择",
    "question": "下图显示的是什么物体?",
    "option1": "苹果",
    "option2": "橙子",
    "option3": "香蕉",
    "option4": "葡萄",
    "answer": "A",
    "subcategory": "生物",
    "image_1": PIL.Image.open("image1.jpg"),
    "image_1_filename": "image1.jpg"
}

# Convert to prompt
prompt = cmmmu_doc_to_text(doc)
# Returns: "请回答以下多项选择题...\n问题：下图显示的是什么物体?\n选项：\n(A) 苹果\n(B) 橙子\n(C) 香蕉\n(D) 葡萄\n正确答案：\n"

# Get images
images = cmmmu_doc_to_visual(doc)
# Returns: [<PIL.Image.Image>]

# Process results
results = ["答案是A,这是一个苹果"]
processed = cmmmu_process_results(doc, results)
# Returns: {"cmmmu_acc": {"id": "cmmmu_001", "subdomain": "生物",
#           "question_type": "选择", "answer": "A", "parsed_pred": "A"}}

# Example 2: Fill-in-blank question
doc_fill = {
    "id": "cmmmu_002",
    "type": "填空",
    "question": "圆周率π约等于多少?",
    "answer": "3.14",
    "subcategory": "数学",
    "image_1": None
}

results = ["π的值大约是3.14159,通常我们取π=3.14"]
processed = cmmmu_process_results(doc_fill, results)
# Parsed prediction will be [3.14] after extraction and normalization

# Example 3: True/False question
doc_tf = {
    "id": "cmmmu_003",
    "type": "判断",
    "question": "地球是平的,这个说法是否正确?",
    "answer": "错",
    "subcategory": "地理"
}

results = ["这个说法是错误的,地球是圆的"]
processed = cmmmu_process_results(doc_tf, results)
# Will detect "错误" keyword and match against answer "错"

# Example 4: Aggregate results
all_results = [
    {"subdomain": "生物", "question_type": "选择", "answer": "A", "parsed_pred": "A"},
    {"subdomain": "数学", "question_type": "填空", "answer": "3.14", "parsed_pred": [3.14]},
    {"subdomain": "地理", "question_type": "判断", "answer": "错", "parsed_pred": ["错误"]}
]

accuracy = cmmmu_aggregate_results(all_results)
# Returns overall accuracy and prints breakdown by domain

Related Pages

Principle:EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment