Implementation:Openai Evals MultipleChoice

Knowledge Sources	Openai_Evals
Domains	Evaluation, Question Answering
Last Updated	2026-02-14 10:00 GMT

Overview

Concrete eval for measuring accuracy on multiple-choice question-answering benchmarks, provided by the evals library.

Description

This module provides a complete pipeline for multiple-choice evaluation. The Sample Pydantic model holds a question, a list of answers, and the integer label index of the correct answer. The get_dataset helper function loads datasets from HuggingFace via a custom hf:// URL scheme, currently supporting hellaswag (mapping ctx and endings fields) and hendrycks_test (mapping question and choices fields) into Sample objects.

The MultipleChoice class extends evals.Eval. During evaluation of each sample, it uses make_abc to shuffle and format the answer options with alphabetic labels (A, B, C, ...), constructs a prompt with optional custom instructions asking the model to respond with the correct letter, calls the completion function with temperature=0.0 and max_tokens=1, and records the match result via evals.record_and_check_match. The run method loads the dataset, evaluates all samples, and returns an accuracy metric.

Usage

Import MultipleChoice when you need to evaluate a model on a standard multiple-choice benchmark. The dataset parameter is specified as an hf:// URL with query parameters for split and subset configuration, and is typically set in the eval YAML spec.

Code Reference

Source Location

Repository: Openai_Evals
File: evals/elsuite/multiple_choice.py
Lines: 1-100

Signature

class Sample(BaseModel):
    question: str
    answers: list[str]
    label: int

def get_dataset(url: str) -> list[Sample]:
    ...

class MultipleChoice(evals.Eval):
    def __init__(
        self,
        completion_fns: list[CompletionFn],
        dataset: str,
        *args,
        instructions: Optional[str] = "",
        **kwargs,
    ):
        ...

    def eval_sample(self, sample, rng):
        ...

    def run(self, recorder: RecorderBase) -> dict:
        ...

Import

from evals.elsuite.multiple_choice import MultipleChoice

I/O Contract

Inputs

Sample (dataclass)

Name	Type	Required	Description
question	str	Yes	The question or context stem for the multiple-choice item
answers	list[str]	Yes	List of candidate answer strings
label	int	Yes	Zero-based index of the correct answer in the answers list

get_dataset

Name	Type	Required	Description
url	str	Yes	HuggingFace dataset URL in the format hf://dataset_name?split=...&name=...

MultipleChoice.init

Name	Type	Required	Description
completion_fns	list[CompletionFn]	Yes	List containing exactly one completion function to evaluate
dataset	str	Yes	HuggingFace dataset URL used by get_dataset to load samples
instructions	Optional[str]	No	Custom instructions prepended to each prompt (defaults to empty string)
*args	Any	No	Positional arguments forwarded to the parent evals.Eval constructor
**kwargs	Any	No	Keyword arguments forwarded to the parent evals.Eval constructor

eval_sample

Name	Type	Required	Description
sample	Sample	Yes	A Sample instance containing question, answers, and correct label
rng	Random	Yes	Random number generator used by make_abc to shuffle answer order

run

Name	Type	Required	Description
recorder	RecorderBase	Yes	Recorder instance that collects match events during evaluation

Outputs

get_dataset

Name	Type	Description
samples	list[Sample]	List of Sample objects parsed from the HuggingFace dataset

run

Name	Type	Description
accuracy	float	Fraction of samples where the model selected the correct answer letter

Usage Examples

from evals.elsuite.multiple_choice import MultipleChoice, get_dataset, Sample
from evals.api import CompletionFn
from evals.record import RecorderBase

# Load a dataset manually to inspect samples
samples = get_dataset("hf://hellaswag?split=validation")
print(f"Loaded {len(samples)} samples")
print(f"First question: {samples[0].question}")

# Run the full eval (typically configured via YAML)
mc_eval = MultipleChoice(
    completion_fns=[my_completion_fn],
    dataset="hf://hendrycks_test?split=test&name=abstract_algebra",
    instructions="You are an expert in abstract algebra.",
)
results = mc_eval.run(recorder)
print(f"Accuracy: {results['accuracy']:.2%}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Overview

Description

Usage

Code Reference

Source Location

Signature

Import

I/O Contract

Inputs

Sample (dataclass)

get_dataset

MultipleChoice.__init__

eval_sample

run

Outputs

get_dataset

run

Usage Examples

Related Pages

Page Connections

MultipleChoice.init