Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval CMMMU Utils

From Leeroopedia
Knowledge Sources
Domains Computer Vision, Multimodal Reasoning, Chinese Language Processing
Last Updated 2026-02-14 00:00 GMT

Overview

Utility functions for evaluating models on the Chinese MMMU (CMMMU) benchmark with multi-choice, true/false, and fill-in-blank questions.

Description

This module provides the evaluation infrastructure for CMMMU (Chinese Massive Multi-discipline Multimodal Understanding), a Chinese-language benchmark assessing multimodal reasoning across multiple academic disciplines. It includes functions to convert documents into model inputs with Chinese prompts and image placeholders, parse model responses for three question types (multiple choice, true/false, fill-in-blank), and aggregate results across academic domains (艺术与设计, 商业, 科学, 健康与医学, 人文社会科学, 技术与工程). The module handles complex answer extraction including number parsing with Chinese formatting, keyword-based true/false detection, and multi-answer matching for fill-in-blank questions.

Usage

Use this module when evaluating multimodal models on CMMMU benchmark tasks. The workflow involves: (1) using cmmmu_doc_to_text/cmmmu_doc_to_visual to prepare Chinese prompts and images, (2) collecting model responses, (3) using cmmmu_process_results to parse and evaluate answers, (4) using cmmmu_aggregate_results to compute accuracy across domains and overall. For test set submissions, use the test-specific functions to generate submission files.

Code Reference

Source Location

Signature

# Document conversion functions
def cmmmu_doc_to_text(doc):
    """Convert document to Chinese text prompt based on question type."""

def cmmmu_doc_to_visual(doc):
    """Extract images referenced in the prompt."""

# Result processing functions
def cmmmu_process_results(doc, results):
    """Process and evaluate single result against ground truth."""

def cmmmu_aggregate_results(results):
    """Aggregate results by domain and subdomain, return overall accuracy."""

# Test submission functions
def cmmmu_process_test_results_for_submission(doc, results):
    """Format test results for submission."""

def cmmmu_test_aggregate_results_for_submission(results, args):
    """Save test results to submission file."""

# Helper functions
def construct_prompt(sample):
    """Build Chinese prompt with question and options."""

def get_multi_choice_prediction(response, all_choices, index2ans):
    """Extract multiple choice answer from response."""

def get_fill_blank_prediction(response, answer):
    """Extract fill-in-blank answers from response."""

def get_TF_prediction(response):
    """Extract true/false answer from response."""

def eval_cmmmu(entries):
    """Evaluate list of entries and compute accuracy."""

Import

from lmms_eval.tasks.cmmmu.utils import (
    cmmmu_doc_to_text,
    cmmmu_doc_to_visual,
    cmmmu_process_results,
    cmmmu_aggregate_results,
    get_multi_choice_prediction,
    get_fill_blank_prediction,
    get_TF_prediction
)

I/O Contract

Inputs

Name Type Required Description
doc dict Yes Document with id, question, type (选择/判断/填空), options, answer, subcategory, images
results list Yes List with single prediction string from model
response str Yes Raw model response for parsing
all_choices list Yes List of valid choice labels (e.g., ['A', 'B', 'C', 'D'])
index2ans dict Yes Mapping from choice labels to answer text
answer str Yes Ground truth answer for comparison

Outputs

Name Type Description
doc_to_text return str Chinese prompt with question and options
doc_to_visual return list List of PIL Images in RGB format
process_results return dict Dict with single key "cmmmu_acc" containing evaluation data
aggregate_results return float Overall accuracy (0.0-1.0)
get_multi_choice_prediction return str Predicted choice(s) (e.g., "A", "AB", "ACD")
get_fill_blank_prediction return list List of normalized predicted answers
get_TF_prediction return list List of key phrases from response
eval_cmmmu return dict Dict with correct_num, entries_num, acc

Usage Examples

# Example 1: Multiple choice question
doc = {
    "id": "cmmmu_001",
    "type": "选择",
    "question": "下图显示的是什么物体?",
    "option1": "苹果",
    "option2": "橙子",
    "option3": "香蕉",
    "option4": "葡萄",
    "answer": "A",
    "subcategory": "生物",
    "image_1": PIL.Image.open("image1.jpg"),
    "image_1_filename": "image1.jpg"
}

# Convert to prompt
prompt = cmmmu_doc_to_text(doc)
# Returns: "请回答以下多项选择题...\n问题:下图显示的是什么物体?\n选项:\n(A) 苹果\n(B) 橙子\n(C) 香蕉\n(D) 葡萄\n正确答案:\n"

# Get images
images = cmmmu_doc_to_visual(doc)
# Returns: [<PIL.Image.Image>]

# Process results
results = ["答案是A,这是一个苹果"]
processed = cmmmu_process_results(doc, results)
# Returns: {"cmmmu_acc": {"id": "cmmmu_001", "subdomain": "生物",
#           "question_type": "选择", "answer": "A", "parsed_pred": "A"}}

# Example 2: Fill-in-blank question
doc_fill = {
    "id": "cmmmu_002",
    "type": "填空",
    "question": "圆周率π约等于多少?",
    "answer": "3.14",
    "subcategory": "数学",
    "image_1": None
}

results = ["π的值大约是3.14159,通常我们取π=3.14"]
processed = cmmmu_process_results(doc_fill, results)
# Parsed prediction will be [3.14] after extraction and normalization

# Example 3: True/False question
doc_tf = {
    "id": "cmmmu_003",
    "type": "判断",
    "question": "地球是平的,这个说法是否正确?",
    "answer": "错",
    "subcategory": "地理"
}

results = ["这个说法是错误的,地球是圆的"]
processed = cmmmu_process_results(doc_tf, results)
# Will detect "错误" keyword and match against answer "错"

# Example 4: Aggregate results
all_results = [
    {"subdomain": "生物", "question_type": "选择", "answer": "A", "parsed_pred": "A"},
    {"subdomain": "数学", "question_type": "填空", "answer": "3.14", "parsed_pred": [3.14]},
    {"subdomain": "地理", "question_type": "判断", "answer": "错", "parsed_pred": ["错误"]}
]

accuracy = cmmmu_aggregate_results(all_results)
# Returns overall accuracy and prints breakdown by domain

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment