Workflow:Openai Evals Creating a model graded eval

Knowledge Sources	OpenAI Evals Build Eval Guide Eval Templates
Domains	LLM_Evaluation, Model_Testing, Prompt_Engineering
Last Updated	2026-02-14 10:00 GMT

Overview

End-to-end process for creating a model-graded evaluation where an LLM judges the quality of its own (or another model's) completions using a custom evaluation prompt.

Description

This workflow covers the creation of model-graded evaluations using the ModelBasedClassify template. Unlike basic evals that check for exact or fuzzy string matches, model-graded evals use a second LLM call to evaluate the quality of the first response. The model's completion to the original prompt is wrapped in an evaluation prompt, and the model produces a judgment (e.g., A/B/C/D or Yes/No) that is parsed into metrics. The framework provides pre-built evaluation prompts for factual accuracy, closed-QA, head-to-head battles, diversity, and humor, but custom evaluation prompts can be created for any use case.

Usage

Execute this workflow when the expected model response has significant variation and cannot be reliably evaluated by simple string matching. This is ideal for open-ended question answering, creative tasks, comparative evaluations, or any scenario where a rubric-based assessment is more appropriate than exact matching.

Execution Steps

Step 1: Select or Design the Evaluation Prompt

Choose from existing model-graded specs (fact, closedqa, battle, diversity, humor, etc.) or design a custom evaluation prompt. The prompt should prime the model to answer in an easily parsable format (multiple choice or yes/no). Place custom specs in a YAML file under evals/registry/modelgraded/. The spec defines the evaluation prompt template, choice strings, optional choice scores, and the input-output mapping.

Key considerations:

Existing specs cover factual accuracy, QA, pairwise comparison, and more
The prompt template uses curly braces for variable substitution (e.g., {completion})
choice_strings defines the valid responses (e.g., "ABCDE" or ["Yes", "No", "Unsure"])
choice_scores maps each choice to a numeric score for metric aggregation
eval_type controls the expected response format: cot_classify, classify_cot, or classify

Step 2: Prepare the Dataset

Create a JSONL dataset where each sample contains the keys required by the evaluation prompt template. For standard model-graded evals, include "input" (the prompt to the model) and any keys referenced in the evaluation prompt. If creating a meta-eval for quality assurance, also include "choice" labels with human-provided ground truth judgments.

Key considerations:

Required keys depend on the evaluation prompt's variable placeholders
Superset of keys can be included to support multiple eval types from one dataset
Place data at evals/registry/data/<eval_name>/samples.jsonl
Human-labeled "choice" keys enable meta-eval validation

Step 3: Configure the Eval Type

Set the eval_type parameter to control how the evaluation model formats its response. The recommended default is "cot_classify" (chain-of-thought then classify), where the model reasons about the quality before stating its judgment at the end. Alternatives are "classify_cot" (answer then reason) and "classify" (judgment only). Specifying eval_type in the eval registry YAML automatically appends an appropriate instruction to the evaluation prompt.

Key considerations:

cot_classify typically provides the most accurate evaluations
The instruction is auto-appended when eval_type is set in the eval YAML
If eval_type is set in the modelgraded YAML instead, include the instruction in the prompt manually
output_template can customize how model outputs are formatted in the evaluation prompt

Step 4: Register the Eval

Create a YAML file in evals/registry/evals/ that references the model-graded class (evals.elsuite.modelgraded.classify:ModelBasedClassify) and points to the evaluation prompt spec and dataset. Include the eval_type in the args section. Optionally register a companion meta-eval that evaluates the quality of the model-graded eval itself by comparing its judgments against human-provided labels.

Key considerations:

The class path is evals.elsuite.modelgraded.classify:ModelBasedClassify
Reference the modelgraded spec by name in the eval args
Meta-evals use labeled data to measure evaluator accuracy (metascore)
Aim for metascore close to 1.0 for reliable model-graded evals

Step 5: Run and Validate

Execute the eval with oaieval and examine the results. The output includes per-sample judgments and aggregate scores. If a meta-eval was created, run it to check the evaluator's accuracy against human labels. Iterate on the evaluation prompt, choice labels, and eval_type to improve evaluation quality.

Key considerations:

The evaluation uses two API calls per sample: one for the original completion, one for grading
Monitor token usage as model-graded evals consume more tokens than basic evals
Check for __invalid__ judgments indicating the model returned unexpected responses
Iterate on the evaluation prompt to reduce invalid responses

Execution Diagram

GitHub URL

Workflow Repository