Implementation:Openai Evals MMMU Eval Config
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Configuration |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
The MMMU (Massive Multi-discipline Multimodal Understanding) eval configuration file registers 30 subject-specific multimodal evaluations that test a model's ability to reason over questions containing images, diagrams, and other visual content across academic disciplines.
Description
mmmu.yaml is a declarative YAML configuration file located in the OpenAI Evals registry. It defines evaluation entries for the MMMU benchmark, which assesses multimodal understanding across college-level subjects. The file contains 419 lines organized as 30 subject groups, each consisting of three YAML mappings:
- Alias entry (e.g., mmmu-accounting) -- provides a human-friendly evaluation name, points to the validation split by default, and declares [accuracy] as the metric.
- Dev entry (e.g., mmmu-accounting.dev.v1) -- specifies the eval class (evals.elsuite.mmmu.eval:MMMU) and loads the dev split from the HuggingFace mmmu/mmmu dataset. Also passes the subject argument as a properly-cased string.
- Validation entry (e.g., mmmu-accounting.validation.v1) -- identical structure to the dev entry but loads the validation split.
Unlike MMLU, the MMMU eval class is a dedicated evals.elsuite.mmmu.eval:MMMU implementation rather than the generic MultipleChoice class. This is because MMMU questions include embedded images and require multimodal processing. Each entry also passes an explicit subject argument (with proper capitalization, e.g., "Architecture and Engineering") in addition to the dataset URI.
The 30 subjects span the following broad categories:
- Art and Design: Art, Art Theory, Design, Music
- Business: Accounting, Economics, Finance, Manage, Marketing
- Science: Biology, Chemistry, Geography, Math, Physics
- Health and Medicine: Basic Medical Science, Clinical Medicine, Diagnostics and Laboratory Medicine, Pharmacy, Public Health, Psychology
- Humanities and Social Sciences: History, Literature, Sociology
- Engineering and Technology: Architecture and Engineering, Computer Science, Electronics, Energy and Power, Materials, Mechanical Engineering
- Other: Agriculture
Usage
Use this configuration when you want to benchmark a model's multimodal reasoning across 30 college-level disciplines. This is the standard way to run MMMU evaluations within the OpenAI Evals framework. Each subject can be run independently by referencing its alias (e.g., mmmu-art), and both dev and validation splits are available for each subject. The alias defaults to the validation split.
Code Reference
Source Location
- Repository: Openai_Evals
- File: evals/registry/evals/mmmu.yaml
- Lines: 1-419
Configuration Schema
The following shows the repeating three-entry pattern used for each of the 30 subjects:
# Alias entry -- defaults to the validation split
mmmu-accounting:
id: mmmu-accounting.validation.v1
metrics: [accuracy]
# Dev split entry
mmmu-accounting.dev.v1:
class: evals.elsuite.mmmu.eval:MMMU
args:
dataset: hf://mmmu/mmmu?name=Accounting&split=dev
subject: Accounting
# Validation split entry
mmmu-accounting.validation.v1:
class: evals.elsuite.mmmu.eval:MMMU
args:
dataset: hf://mmmu/mmmu?name=Accounting&split=validation
subject: Accounting
A second representative example for a multi-word subject:
mmmu-architecture-and-engineering:
id: mmmu-architecture-and-engineering.validation.v1
metrics: [accuracy]
mmmu-architecture-and-engineering.dev.v1:
class: evals.elsuite.mmmu.eval:MMMU
args:
dataset: hf://mmmu/mmmu?name=Architecture_and_Engineering&split=dev
subject: Architecture and Engineering
mmmu-architecture-and-engineering.validation.v1:
class: evals.elsuite.mmmu.eval:MMMU
args:
dataset: hf://mmmu/mmmu?name=Architecture_and_Engineering&split=validation
subject: Architecture and Engineering
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| id | string | Yes | Versioned eval identifier that the alias resolves to (e.g., mmmu-accounting.validation.v1) |
| metrics | list[string] | Yes | List of metric names to compute; always [accuracy] for MMMU |
| class | string | Yes | Fully-qualified Python class path (evals.elsuite.mmmu.eval:MMMU) |
| args.dataset | string (URI) | Yes | validation} |
| args.subject | string | Yes | Human-readable subject name with proper capitalization (e.g., "Architecture and Engineering") |
Outputs
| Name | Type | Description |
|---|---|---|
| accuracy | float | Fraction of multimodal questions the model answered correctly (0.0 to 1.0) for the given subject |
Usage Examples
Running a Single MMMU Subject (Validation Split)
oaieval gpt-4 mmmu-art
Running a Specific Split Directly
oaieval gpt-4 mmmu-physics.dev.v1
Running a Multi-Word Subject
oaieval gpt-4 mmmu-architecture-and-engineering
Related Pages
- Openai_Evals_Eval_YAML_Registration -- describes the registry mechanism that loads this YAML file
- Openai_Evals_Registry_Get_Eval -- the function that resolves eval aliases to versioned specs
- Openai_Evals_Oaieval_Run -- the CLI entrypoint used to launch evaluations