Principle:Openai Evals Multiple Choice Evaluation

Knowledge Sources	Openai_Evals Measuring Massive Multitask Language Understanding MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark HellaSwag: Can a Machine Really Finish Your Sentence?
Domains	Evaluation, Multiple Choice, Benchmarking
Last Updated	2026-02-14 10:00 GMT

Overview

A principle that defines a standardized approach for evaluating large language models on multiple-choice question-answering tasks by reducing open-ended generation to discrete classification over labeled answer options.

Description

Multiple-choice evaluation is one of the most widely adopted paradigms for measuring LLM capability across knowledge domains. The evaluation workflow follows a consistent pattern:

The model receives a question accompanied by a set of labeled options (A, B, C, D, ...).
The model produces a single letter answer corresponding to its chosen option.
Accuracy is computed by comparing the predicted letter to the ground-truth label.

This principle is implemented across several prominent benchmarks within the Openai Evals framework:

MMLU (Massive Multitask Language Understanding) -- a text-only benchmark spanning 57 academic subjects ranging from elementary mathematics to professional law. Each subject contains questions at varying difficulty levels, and the model is evaluated per-subject and in aggregate.
MMMU (Massive Multi-discipline Multimodal Understanding) -- extends the MCQ paradigm to multimodal inputs across 32 subjects. Questions may include images, diagrams, or charts alongside text, testing the model's ability to integrate visual and textual information.
HellaSwag -- a commonsense reasoning benchmark where the model must select the most plausible continuation of a given scenario from four options.

A critical design consideration is random label shuffling. By randomizing the mapping between answer content and position labels (A, B, C, D), the evaluation guards against position bias -- the tendency of some models to favor certain answer positions regardless of content.

Usage

Apply multiple-choice evaluation when:

You need a precise, reproducible accuracy metric rather than subjective quality judgments.
The task can be naturally formulated as selecting from a fixed set of options.
You want to benchmark across many domains simultaneously (e.g., testing broad knowledge with MMLU).
You need to compare models on a common leaderboard with well-defined scoring.
You want to incorporate multimodal inputs (images, diagrams) while retaining a simple scoring mechanism (MMMU).

Theoretical Basis

The theoretical foundation of MCQ evaluation rests on reducing generative tasks to classification, which yields several formal advantages:

Accuracy as a metric:

accuracy = (number of correct predictions) / (total number of questions)

This is well-defined, deterministic given the same model outputs, and comparable across models.

Position bias mitigation through label shuffling:

Given a question with options {O_1, O_2, ..., O_k} and a fixed correct answer O_c, the evaluation randomly assigns labels:

shuffle: {O_1, O_2, ..., O_k} -> {A, B, C, ..., K}

This ensures that a model achieving above-chance accuracy must rely on content understanding rather than positional heuristics.

Per-subject aggregation:

For benchmarks like MMLU, accuracy is computed both per-subject and macro-averaged:

macro_accuracy = (1 / N_subjects) * sum(accuracy_subject_i for i in 1..N_subjects)

This prevents subjects with more questions from dominating the overall score.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment