Implementation:Openai Evals MMMU Eval Config

Knowledge Sources	Openai_Evals
Domains	Evaluation, Configuration
Last Updated	2026-02-14 10:00 GMT

Overview

The MMMU (Massive Multi-discipline Multimodal Understanding) eval configuration file registers 30 subject-specific multimodal evaluations that test a model's ability to reason over questions containing images, diagrams, and other visual content across academic disciplines.

Description

mmmu.yaml is a declarative YAML configuration file located in the OpenAI Evals registry. It defines evaluation entries for the MMMU benchmark, which assesses multimodal understanding across college-level subjects. The file contains 419 lines organized as 30 subject groups, each consisting of three YAML mappings:

Alias entry (e.g., mmmu-accounting) -- provides a human-friendly evaluation name, points to the validation split by default, and declares [accuracy] as the metric.
Dev entry (e.g., mmmu-accounting.dev.v1) -- specifies the eval class (evals.elsuite.mmmu.eval:MMMU) and loads the dev split from the HuggingFace mmmu/mmmu dataset. Also passes the subject argument as a properly-cased string.
Validation entry (e.g., mmmu-accounting.validation.v1) -- identical structure to the dev entry but loads the validation split.

Unlike MMLU, the MMMU eval class is a dedicated evals.elsuite.mmmu.eval:MMMU implementation rather than the generic MultipleChoice class. This is because MMMU questions include embedded images and require multimodal processing. Each entry also passes an explicit subject argument (with proper capitalization, e.g., "Architecture and Engineering") in addition to the dataset URI.

The 30 subjects span the following broad categories:

Art and Design: Art, Art Theory, Design, Music
Business: Accounting, Economics, Finance, Manage, Marketing
Science: Biology, Chemistry, Geography, Math, Physics
Health and Medicine: Basic Medical Science, Clinical Medicine, Diagnostics and Laboratory Medicine, Pharmacy, Public Health, Psychology
Humanities and Social Sciences: History, Literature, Sociology
Engineering and Technology: Architecture and Engineering, Computer Science, Electronics, Energy and Power, Materials, Mechanical Engineering
Other: Agriculture

Usage

Use this configuration when you want to benchmark a model's multimodal reasoning across 30 college-level disciplines. This is the standard way to run MMMU evaluations within the OpenAI Evals framework. Each subject can be run independently by referencing its alias (e.g., mmmu-art), and both dev and validation splits are available for each subject. The alias defaults to the validation split.

Code Reference

Source Location

Repository: Openai_Evals
File: evals/registry/evals/mmmu.yaml
Lines: 1-419

Configuration Schema

The following shows the repeating three-entry pattern used for each of the 30 subjects:

# Alias entry -- defaults to the validation split
mmmu-accounting:
  id: mmmu-accounting.validation.v1
  metrics: [accuracy]

# Dev split entry
mmmu-accounting.dev.v1:
  class: evals.elsuite.mmmu.eval:MMMU
  args:
    dataset: hf://mmmu/mmmu?name=Accounting&split=dev
    subject: Accounting

# Validation split entry
mmmu-accounting.validation.v1:
  class: evals.elsuite.mmmu.eval:MMMU
  args:
    dataset: hf://mmmu/mmmu?name=Accounting&split=validation
    subject: Accounting

A second representative example for a multi-word subject:

mmmu-architecture-and-engineering:
  id: mmmu-architecture-and-engineering.validation.v1
  metrics: [accuracy]

mmmu-architecture-and-engineering.dev.v1:
  class: evals.elsuite.mmmu.eval:MMMU
  args:
    dataset: hf://mmmu/mmmu?name=Architecture_and_Engineering&split=dev
    subject: Architecture and Engineering

mmmu-architecture-and-engineering.validation.v1:
  class: evals.elsuite.mmmu.eval:MMMU
  args:
    dataset: hf://mmmu/mmmu?name=Architecture_and_Engineering&split=validation
    subject: Architecture and Engineering

I/O Contract

Inputs

Name	Type	Required	Description
id	string	Yes	Versioned eval identifier that the alias resolves to (e.g., mmmu-accounting.validation.v1)
metrics	list[string]	Yes	List of metric names to compute; always [accuracy] for MMMU
class	string	Yes	Fully-qualified Python class path (evals.elsuite.mmmu.eval:MMMU)
args.dataset	string (URI)	Yes	validation}
args.subject	string	Yes	Human-readable subject name with proper capitalization (e.g., "Architecture and Engineering")

Outputs

Name	Type	Description
accuracy	float	Fraction of multimodal questions the model answered correctly (0.0 to 1.0) for the given subject

Usage Examples

Running a Single MMMU Subject (Validation Split)

oaieval gpt-4 mmmu-art

Running a Specific Split Directly

oaieval gpt-4 mmmu-physics.dev.v1

Running a Multi-Word Subject

oaieval gpt-4 mmmu-architecture-and-engineering

Related Pages

Openai_Evals_Eval_YAML_Registration -- describes the registry mechanism that loads this YAML file
Openai_Evals_Registry_Get_Eval -- the function that resolves eval aliases to versioned specs
Openai_Evals_Oaieval_Run -- the CLI entrypoint used to launch evaluations

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment