Overview
Concrete eval for measuring accuracy on multiple-choice question-answering benchmarks, provided by the evals library.
Description
This module provides a complete pipeline for multiple-choice evaluation. The Sample Pydantic model holds a question, a list of answers, and the integer label index of the correct answer. The get_dataset helper function loads datasets from HuggingFace via a custom hf:// URL scheme, currently supporting hellaswag (mapping ctx and endings fields) and hendrycks_test (mapping question and choices fields) into Sample objects.
The MultipleChoice class extends evals.Eval. During evaluation of each sample, it uses make_abc to shuffle and format the answer options with alphabetic labels (A, B, C, ...), constructs a prompt with optional custom instructions asking the model to respond with the correct letter, calls the completion function with temperature=0.0 and max_tokens=1, and records the match result via evals.record_and_check_match. The run method loads the dataset, evaluates all samples, and returns an accuracy metric.
Usage
Import MultipleChoice when you need to evaluate a model on a standard multiple-choice benchmark. The dataset parameter is specified as an hf:// URL with query parameters for split and subset configuration, and is typically set in the eval YAML spec.
Code Reference
Source Location
Signature
class Sample(BaseModel):
question: str
answers: list[str]
label: int
def get_dataset(url: str) -> list[Sample]:
...
class MultipleChoice(evals.Eval):
def __init__(
self,
completion_fns: list[CompletionFn],
dataset: str,
*args,
instructions: Optional[str] = "",
**kwargs,
):
...
def eval_sample(self, sample, rng):
...
def run(self, recorder: RecorderBase) -> dict:
...
Import
from evals.elsuite.multiple_choice import MultipleChoice
I/O Contract
Inputs
Sample (dataclass)
| Name |
Type |
Required |
Description
|
| question |
str |
Yes |
The question or context stem for the multiple-choice item
|
| answers |
list[str] |
Yes |
List of candidate answer strings
|
| label |
int |
Yes |
Zero-based index of the correct answer in the answers list
|
get_dataset
| Name |
Type |
Required |
Description
|
| url |
str |
Yes |
HuggingFace dataset URL in the format hf://dataset_name?split=...&name=...
|
MultipleChoice.__init__
| Name |
Type |
Required |
Description
|
| completion_fns |
list[CompletionFn] |
Yes |
List containing exactly one completion function to evaluate
|
| dataset |
str |
Yes |
HuggingFace dataset URL used by get_dataset to load samples
|
| instructions |
Optional[str] |
No |
Custom instructions prepended to each prompt (defaults to empty string)
|
| *args |
Any |
No |
Positional arguments forwarded to the parent evals.Eval constructor
|
| **kwargs |
Any |
No |
Keyword arguments forwarded to the parent evals.Eval constructor
|
eval_sample
| Name |
Type |
Required |
Description
|
| sample |
Sample |
Yes |
A Sample instance containing question, answers, and correct label
|
| rng |
Random |
Yes |
Random number generator used by make_abc to shuffle answer order
|
run
| Name |
Type |
Required |
Description
|
| recorder |
RecorderBase |
Yes |
Recorder instance that collects match events during evaluation
|
Outputs
get_dataset
| Name |
Type |
Description
|
| samples |
list[Sample] |
List of Sample objects parsed from the HuggingFace dataset
|
run
| Name |
Type |
Description
|
| accuracy |
float |
Fraction of samples where the model selected the correct answer letter
|
Usage Examples
from evals.elsuite.multiple_choice import MultipleChoice, get_dataset, Sample
from evals.api import CompletionFn
from evals.record import RecorderBase
# Load a dataset manually to inspect samples
samples = get_dataset("hf://hellaswag?split=validation")
print(f"Loaded {len(samples)} samples")
print(f"First question: {samples[0].question}")
# Run the full eval (typically configured via YAML)
mc_eval = MultipleChoice(
completion_fns=[my_completion_fn],
dataset="hf://hendrycks_test?split=test&name=abstract_algebra",
instructions="You are an expert in abstract algebra.",
)
results = mc_eval.run(recorder)
print(f"Accuracy: {results['accuracy']:.2%}")
Related Pages