Implementation:Openai Evals ModelGradedSpec
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, LLM_as_Judge |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Concrete dataclass for defining model-graded evaluation specifications provided by the evals modelgraded module.
Description
The ModelGradedSpec is a pydantic dataclass that stores the configuration for LLM-as-judge evaluations. It defines the evaluation prompt template, valid choice strings, input-output field mapping, and optional scoring. Specs are loaded from YAML files in evals/registry/modelgraded/ by the Registry system and consumed by ModelBasedClassify and the classify function.
Usage
Define a ModelGradedSpec as a YAML file when creating model-graded evaluations. Reference it by filename (without extension) in the modelgraded_spec argument of ModelBasedClassify.
Code Reference
Source Location
- Repository: openai/evals
- File: evals/elsuite/modelgraded/base.py (lines 11-26)
Signature
@dataclass
class ModelGradedSpec:
# Required fields
prompt: Union[str, OpenAICreateChatPrompt]
choice_strings: Union[list[str], str]
input_outputs: dict[str, str]
# Optional fields
eval_type: Optional[str] = None
choice_scores: Optional[Union[dict[str, float], str]] = None
output_template: Optional[str] = None
# Registry metadata
key: Optional[str] = None
group: Optional[str] = None
Import
from evals.elsuite.modelgraded.base import ModelGradedSpec
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | Union[str, list[dict]] | Yes | Evaluation prompt template with {placeholders} |
| choice_strings | Union[list[str], str] | Yes | Valid answers: list of strings, "from_n", "from_n_abc", or "from_n_ABC" |
| input_outputs | dict[str, str] | Yes | Maps sample keys to template variable names |
| eval_type | Optional[str] | No | "classify", "classify_cot", or "cot_classify" |
| choice_scores | Optional[Union[dict, str]] | No | Numeric scores per choice, or "from_strings" |
| output_template | Optional[str] | No | Template for formatting multi-completion output |
Outputs
| Name | Type | Description |
|---|---|---|
| ModelGradedSpec instance | ModelGradedSpec | Configured spec ready for use by classify() or ModelBasedClassify |
Usage Examples
YAML Spec for Factual Accuracy
# File: evals/registry/modelgraded/fact.yaml
prompt: >
You are comparing a submitted answer to an expert answer on a given question.
[Q]: {input}
[A]: {ideal}
[Submission]: {completion}
Compare the submitted answer to the expert answer. Is the submission correct, incorrect, or unsure?
choice_strings:
- "Yes"
- "No"
- "Unsure"
input_outputs:
input: completion
ideal: expected
eval_type: cot_classify
choice_scores:
"Yes": 1.0
"No": 0.0
"Unsure": 0.5
Programmatic Construction
from evals.elsuite.modelgraded.base import ModelGradedSpec
spec = ModelGradedSpec(
prompt="Is the following answer correct? {input} Answer: {completion} Expected: {ideal}",
choice_strings=["Yes", "No"],
input_outputs={"input": "completion", "ideal": "expected"},
eval_type="classify",
choice_scores={"Yes": 1.0, "No": 0.0},
)