Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Evals MMMU Eval Config

From Leeroopedia
Knowledge Sources
Domains Evaluation, Configuration
Last Updated 2026-02-14 10:00 GMT

Overview

The MMMU (Massive Multi-discipline Multimodal Understanding) eval configuration file registers 30 subject-specific multimodal evaluations that test a model's ability to reason over questions containing images, diagrams, and other visual content across academic disciplines.

Description

mmmu.yaml is a declarative YAML configuration file located in the OpenAI Evals registry. It defines evaluation entries for the MMMU benchmark, which assesses multimodal understanding across college-level subjects. The file contains 419 lines organized as 30 subject groups, each consisting of three YAML mappings:

  • Alias entry (e.g., mmmu-accounting) -- provides a human-friendly evaluation name, points to the validation split by default, and declares [accuracy] as the metric.
  • Dev entry (e.g., mmmu-accounting.dev.v1) -- specifies the eval class (evals.elsuite.mmmu.eval:MMMU) and loads the dev split from the HuggingFace mmmu/mmmu dataset. Also passes the subject argument as a properly-cased string.
  • Validation entry (e.g., mmmu-accounting.validation.v1) -- identical structure to the dev entry but loads the validation split.

Unlike MMLU, the MMMU eval class is a dedicated evals.elsuite.mmmu.eval:MMMU implementation rather than the generic MultipleChoice class. This is because MMMU questions include embedded images and require multimodal processing. Each entry also passes an explicit subject argument (with proper capitalization, e.g., "Architecture and Engineering") in addition to the dataset URI.

The 30 subjects span the following broad categories:

  • Art and Design: Art, Art Theory, Design, Music
  • Business: Accounting, Economics, Finance, Manage, Marketing
  • Science: Biology, Chemistry, Geography, Math, Physics
  • Health and Medicine: Basic Medical Science, Clinical Medicine, Diagnostics and Laboratory Medicine, Pharmacy, Public Health, Psychology
  • Humanities and Social Sciences: History, Literature, Sociology
  • Engineering and Technology: Architecture and Engineering, Computer Science, Electronics, Energy and Power, Materials, Mechanical Engineering
  • Other: Agriculture

Usage

Use this configuration when you want to benchmark a model's multimodal reasoning across 30 college-level disciplines. This is the standard way to run MMMU evaluations within the OpenAI Evals framework. Each subject can be run independently by referencing its alias (e.g., mmmu-art), and both dev and validation splits are available for each subject. The alias defaults to the validation split.

Code Reference

Source Location

Configuration Schema

The following shows the repeating three-entry pattern used for each of the 30 subjects:

# Alias entry -- defaults to the validation split
mmmu-accounting:
  id: mmmu-accounting.validation.v1
  metrics: [accuracy]

# Dev split entry
mmmu-accounting.dev.v1:
  class: evals.elsuite.mmmu.eval:MMMU
  args:
    dataset: hf://mmmu/mmmu?name=Accounting&split=dev
    subject: Accounting

# Validation split entry
mmmu-accounting.validation.v1:
  class: evals.elsuite.mmmu.eval:MMMU
  args:
    dataset: hf://mmmu/mmmu?name=Accounting&split=validation
    subject: Accounting

A second representative example for a multi-word subject:

mmmu-architecture-and-engineering:
  id: mmmu-architecture-and-engineering.validation.v1
  metrics: [accuracy]

mmmu-architecture-and-engineering.dev.v1:
  class: evals.elsuite.mmmu.eval:MMMU
  args:
    dataset: hf://mmmu/mmmu?name=Architecture_and_Engineering&split=dev
    subject: Architecture and Engineering

mmmu-architecture-and-engineering.validation.v1:
  class: evals.elsuite.mmmu.eval:MMMU
  args:
    dataset: hf://mmmu/mmmu?name=Architecture_and_Engineering&split=validation
    subject: Architecture and Engineering

I/O Contract

Inputs

Name Type Required Description
id string Yes Versioned eval identifier that the alias resolves to (e.g., mmmu-accounting.validation.v1)
metrics list[string] Yes List of metric names to compute; always [accuracy] for MMMU
class string Yes Fully-qualified Python class path (evals.elsuite.mmmu.eval:MMMU)
args.dataset string (URI) Yes validation}
args.subject string Yes Human-readable subject name with proper capitalization (e.g., "Architecture and Engineering")

Outputs

Name Type Description
accuracy float Fraction of multimodal questions the model answered correctly (0.0 to 1.0) for the given subject

Usage Examples

Running a Single MMMU Subject (Validation Split)

oaieval gpt-4 mmmu-art

Running a Specific Split Directly

oaieval gpt-4 mmmu-physics.dev.v1

Running a Multi-Word Subject

oaieval gpt-4 mmmu-architecture-and-engineering

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment