Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Evals MultipleChoice

From Leeroopedia
Knowledge Sources
Domains Evaluation, Question Answering
Last Updated 2026-02-14 10:00 GMT

Overview

Concrete eval for measuring accuracy on multiple-choice question-answering benchmarks, provided by the evals library.

Description

This module provides a complete pipeline for multiple-choice evaluation. The Sample Pydantic model holds a question, a list of answers, and the integer label index of the correct answer. The get_dataset helper function loads datasets from HuggingFace via a custom hf:// URL scheme, currently supporting hellaswag (mapping ctx and endings fields) and hendrycks_test (mapping question and choices fields) into Sample objects.

The MultipleChoice class extends evals.Eval. During evaluation of each sample, it uses make_abc to shuffle and format the answer options with alphabetic labels (A, B, C, ...), constructs a prompt with optional custom instructions asking the model to respond with the correct letter, calls the completion function with temperature=0.0 and max_tokens=1, and records the match result via evals.record_and_check_match. The run method loads the dataset, evaluates all samples, and returns an accuracy metric.

Usage

Import MultipleChoice when you need to evaluate a model on a standard multiple-choice benchmark. The dataset parameter is specified as an hf:// URL with query parameters for split and subset configuration, and is typically set in the eval YAML spec.

Code Reference

Source Location

Signature

class Sample(BaseModel):
    question: str
    answers: list[str]
    label: int

def get_dataset(url: str) -> list[Sample]:
    ...

class MultipleChoice(evals.Eval):
    def __init__(
        self,
        completion_fns: list[CompletionFn],
        dataset: str,
        *args,
        instructions: Optional[str] = "",
        **kwargs,
    ):
        ...

    def eval_sample(self, sample, rng):
        ...

    def run(self, recorder: RecorderBase) -> dict:
        ...

Import

from evals.elsuite.multiple_choice import MultipleChoice

I/O Contract

Inputs

Sample (dataclass)

Name Type Required Description
question str Yes The question or context stem for the multiple-choice item
answers list[str] Yes List of candidate answer strings
label int Yes Zero-based index of the correct answer in the answers list

get_dataset

Name Type Required Description
url str Yes HuggingFace dataset URL in the format hf://dataset_name?split=...&name=...

MultipleChoice.__init__

Name Type Required Description
completion_fns list[CompletionFn] Yes List containing exactly one completion function to evaluate
dataset str Yes HuggingFace dataset URL used by get_dataset to load samples
instructions Optional[str] No Custom instructions prepended to each prompt (defaults to empty string)
*args Any No Positional arguments forwarded to the parent evals.Eval constructor
**kwargs Any No Keyword arguments forwarded to the parent evals.Eval constructor

eval_sample

Name Type Required Description
sample Sample Yes A Sample instance containing question, answers, and correct label
rng Random Yes Random number generator used by make_abc to shuffle answer order

run

Name Type Required Description
recorder RecorderBase Yes Recorder instance that collects match events during evaluation

Outputs

get_dataset

Name Type Description
samples list[Sample] List of Sample objects parsed from the HuggingFace dataset

run

Name Type Description
accuracy float Fraction of samples where the model selected the correct answer letter

Usage Examples

from evals.elsuite.multiple_choice import MultipleChoice, get_dataset, Sample
from evals.api import CompletionFn
from evals.record import RecorderBase

# Load a dataset manually to inspect samples
samples = get_dataset("hf://hellaswag?split=validation")
print(f"Loaded {len(samples)} samples")
print(f"First question: {samples[0].question}")

# Run the full eval (typically configured via YAML)
mc_eval = MultipleChoice(
    completion_fns=[my_completion_fn],
    dataset="hf://hendrycks_test?split=test&name=abstract_algebra",
    instructions="You are an expert in abstract algebra.",
)
results = mc_eval.run(recorder)
print(f"Accuracy: {results['accuracy']:.2%}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment