Workflow:Openai Evals Creating a model graded eval
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Model_Testing, Prompt_Engineering |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
End-to-end process for creating a model-graded evaluation where an LLM judges the quality of its own (or another model's) completions using a custom evaluation prompt.
Description
This workflow covers the creation of model-graded evaluations using the ModelBasedClassify template. Unlike basic evals that check for exact or fuzzy string matches, model-graded evals use a second LLM call to evaluate the quality of the first response. The model's completion to the original prompt is wrapped in an evaluation prompt, and the model produces a judgment (e.g., A/B/C/D or Yes/No) that is parsed into metrics. The framework provides pre-built evaluation prompts for factual accuracy, closed-QA, head-to-head battles, diversity, and humor, but custom evaluation prompts can be created for any use case.
Usage
Execute this workflow when the expected model response has significant variation and cannot be reliably evaluated by simple string matching. This is ideal for open-ended question answering, creative tasks, comparative evaluations, or any scenario where a rubric-based assessment is more appropriate than exact matching.
Execution Steps
Step 1: Select or Design the Evaluation Prompt
Choose from existing model-graded specs (fact, closedqa, battle, diversity, humor, etc.) or design a custom evaluation prompt. The prompt should prime the model to answer in an easily parsable format (multiple choice or yes/no). Place custom specs in a YAML file under evals/registry/modelgraded/. The spec defines the evaluation prompt template, choice strings, optional choice scores, and the input-output mapping.
Key considerations:
- Existing specs cover factual accuracy, QA, pairwise comparison, and more
- The prompt template uses curly braces for variable substitution (e.g., {completion})
- choice_strings defines the valid responses (e.g., "ABCDE" or ["Yes", "No", "Unsure"])
- choice_scores maps each choice to a numeric score for metric aggregation
- eval_type controls the expected response format: cot_classify, classify_cot, or classify
Step 2: Prepare the Dataset
Create a JSONL dataset where each sample contains the keys required by the evaluation prompt template. For standard model-graded evals, include "input" (the prompt to the model) and any keys referenced in the evaluation prompt. If creating a meta-eval for quality assurance, also include "choice" labels with human-provided ground truth judgments.
Key considerations:
- Required keys depend on the evaluation prompt's variable placeholders
- Superset of keys can be included to support multiple eval types from one dataset
- Place data at evals/registry/data/<eval_name>/samples.jsonl
- Human-labeled "choice" keys enable meta-eval validation
Step 3: Configure the Eval Type
Set the eval_type parameter to control how the evaluation model formats its response. The recommended default is "cot_classify" (chain-of-thought then classify), where the model reasons about the quality before stating its judgment at the end. Alternatives are "classify_cot" (answer then reason) and "classify" (judgment only). Specifying eval_type in the eval registry YAML automatically appends an appropriate instruction to the evaluation prompt.
Key considerations:
- cot_classify typically provides the most accurate evaluations
- The instruction is auto-appended when eval_type is set in the eval YAML
- If eval_type is set in the modelgraded YAML instead, include the instruction in the prompt manually
- output_template can customize how model outputs are formatted in the evaluation prompt
Step 4: Register the Eval
Create a YAML file in evals/registry/evals/ that references the model-graded class (evals.elsuite.modelgraded.classify:ModelBasedClassify) and points to the evaluation prompt spec and dataset. Include the eval_type in the args section. Optionally register a companion meta-eval that evaluates the quality of the model-graded eval itself by comparing its judgments against human-provided labels.
Key considerations:
- The class path is evals.elsuite.modelgraded.classify:ModelBasedClassify
- Reference the modelgraded spec by name in the eval args
- Meta-evals use labeled data to measure evaluator accuracy (metascore)
- Aim for metascore close to 1.0 for reliable model-graded evals
Step 5: Run and Validate
Execute the eval with oaieval and examine the results. The output includes per-sample judgments and aggregate scores. If a meta-eval was created, run it to check the evaluator's accuracy against human labels. Iterate on the evaluation prompt, choice labels, and eval_type to improve evaluation quality.
Key considerations:
- The evaluation uses two API calls per sample: one for the original completion, one for grading
- Monitor token usage as model-graded evals consume more tokens than basic evals
- Check for __invalid__ judgments indicating the model returned unexpected responses
- Iterate on the evaluation prompt to reduce invalid responses