Principle:Openai Evals Multiple Choice Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Multiple Choice, Benchmarking |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
A principle that defines a standardized approach for evaluating large language models on multiple-choice question-answering tasks by reducing open-ended generation to discrete classification over labeled answer options.
Description
Multiple-choice evaluation is one of the most widely adopted paradigms for measuring LLM capability across knowledge domains. The evaluation workflow follows a consistent pattern:
- The model receives a question accompanied by a set of labeled options (A, B, C, D, ...).
- The model produces a single letter answer corresponding to its chosen option.
- Accuracy is computed by comparing the predicted letter to the ground-truth label.
This principle is implemented across several prominent benchmarks within the Openai Evals framework:
- MMLU (Massive Multitask Language Understanding) -- a text-only benchmark spanning 57 academic subjects ranging from elementary mathematics to professional law. Each subject contains questions at varying difficulty levels, and the model is evaluated per-subject and in aggregate.
- MMMU (Massive Multi-discipline Multimodal Understanding) -- extends the MCQ paradigm to multimodal inputs across 32 subjects. Questions may include images, diagrams, or charts alongside text, testing the model's ability to integrate visual and textual information.
- HellaSwag -- a commonsense reasoning benchmark where the model must select the most plausible continuation of a given scenario from four options.
A critical design consideration is random label shuffling. By randomizing the mapping between answer content and position labels (A, B, C, D), the evaluation guards against position bias -- the tendency of some models to favor certain answer positions regardless of content.
Usage
Apply multiple-choice evaluation when:
- You need a precise, reproducible accuracy metric rather than subjective quality judgments.
- The task can be naturally formulated as selecting from a fixed set of options.
- You want to benchmark across many domains simultaneously (e.g., testing broad knowledge with MMLU).
- You need to compare models on a common leaderboard with well-defined scoring.
- You want to incorporate multimodal inputs (images, diagrams) while retaining a simple scoring mechanism (MMMU).
Theoretical Basis
The theoretical foundation of MCQ evaluation rests on reducing generative tasks to classification, which yields several formal advantages:
Accuracy as a metric:
accuracy = (number of correct predictions) / (total number of questions)
This is well-defined, deterministic given the same model outputs, and comparable across models.
Position bias mitigation through label shuffling:
Given a question with options {O_1, O_2, ..., O_k} and a fixed correct answer O_c, the evaluation randomly assigns labels:
shuffle: {O_1, O_2, ..., O_k} -> {A, B, C, ..., K}
This ensures that a model achieving above-chance accuracy must rely on content understanding rather than positional heuristics.
Per-subject aggregation:
For benchmarks like MMLU, accuracy is computed both per-subject and macro-averaged:
macro_accuracy = (1 / N_subjects) * sum(accuracy_subject_i for i in 1..N_subjects)
This prevents subjects with more questions from dominating the overall score.