Heuristic:Openai Evals Model Graded Eval Design
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Optimization |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Design guidance for choosing between `classify` and `cot_classify` eval types when building model-graded evaluations, balancing accuracy against cost.
Description
The OpenAI Evals framework provides two primary eval types for model-graded evaluations: `classify` (direct classification) and `cot_classify` (chain-of-thought classification). The `classify` type asks the grading model to directly output a classification label, while `cot_classify` first prompts the model to reason step-by-step before giving its classification. The `classify` function in `evals/elsuite/modelgraded/classify_utils.py` implements both modes, with `cot_classify` appending a chain-of-thought instruction to the evaluation prompt.
Usage
Use this heuristic when designing a new model-graded eval and choosing between `classify` and `cot_classify` eval types, or when optimizing the cost vs accuracy of an existing model-graded eval.
The Insight (Rule of Thumb)
- Action: Use `classify` for straightforward grading tasks where the correct label is obvious from the output.
- Value: Lower cost (fewer tokens) and faster execution.
- Trade-off: May produce less accurate grades for nuanced or complex evaluation criteria.
- Action: Use `cot_classify` for complex grading tasks requiring nuanced judgment.
- Value: Higher grading accuracy due to chain-of-thought reasoning.
- Trade-off: Higher cost (more output tokens) and slower execution. Approximately 2-3x the token usage of `classify`.
- Action: Choose the grading model carefully. The grading model should be at least as capable as the model being evaluated.
- Value: Use a stronger model (e.g., gpt-4) as the grader when evaluating weaker models.
- Trade-off: Stronger grading models cost more per evaluation sample.
Reasoning
Chain-of-thought prompting is well-established as improving LLM reasoning accuracy. In the context of model-graded evaluations, the grading model must make a judgment call about the quality of a response. For simple factual checks (e.g., "Did the response contain the correct answer?"), direct classification suffices. For subjective or multi-criteria evaluations (e.g., "Is this response helpful, accurate, and well-structured?"), CoT reasoning helps the grading model consider multiple factors before rendering judgment.
The `ModelGradedSpec` dataclass captures the evaluation prompt template, the valid choice strings, and whether to use chain-of-thought. This configuration is stored in YAML files under `evals/registry/modelgraded/`.
Code Evidence
ModelGradedSpec definition from `evals/elsuite/modelgraded/base.py:11-26`:
@dataclass
class ModelGradedSpec:
prompt: Union[str, list[dict[str, str]]]
choice_strings: Union[list[str], str]
input_outputs: dict[str, str] = field(default_factory=dict)
eval_type: str = "classify"
# ...
Classify function from `evals/elsuite/modelgraded/classify_utils.py:51-87`:
def classify(
model_spec: ModelSpec,
prompt: OpenAICreateChatPrompt,
choice_strings: Union[list[str], str],
eval_type: str,
# ...
):
# eval_type determines whether to use direct classify or cot_classify