Heuristic:Openai Evals Model Graded Eval Design

Knowledge Sources	OpenAI Evals Eval Templates
Domains	LLM_Evaluation, Optimization
Last Updated	2026-02-14 10:00 GMT

Overview

Design guidance for choosing between `classify` and `cot_classify` eval types when building model-graded evaluations, balancing accuracy against cost.

Description

The OpenAI Evals framework provides two primary eval types for model-graded evaluations: `classify` (direct classification) and `cot_classify` (chain-of-thought classification). The `classify` type asks the grading model to directly output a classification label, while `cot_classify` first prompts the model to reason step-by-step before giving its classification. The `classify` function in `evals/elsuite/modelgraded/classify_utils.py` implements both modes, with `cot_classify` appending a chain-of-thought instruction to the evaluation prompt.

Usage

Use this heuristic when designing a new model-graded eval and choosing between `classify` and `cot_classify` eval types, or when optimizing the cost vs accuracy of an existing model-graded eval.

The Insight (Rule of Thumb)

Action: Use `classify` for straightforward grading tasks where the correct label is obvious from the output.
Value: Lower cost (fewer tokens) and faster execution.
Trade-off: May produce less accurate grades for nuanced or complex evaluation criteria.

Action: Use `cot_classify` for complex grading tasks requiring nuanced judgment.
Value: Higher grading accuracy due to chain-of-thought reasoning.
Trade-off: Higher cost (more output tokens) and slower execution. Approximately 2-3x the token usage of `classify`.

Action: Choose the grading model carefully. The grading model should be at least as capable as the model being evaluated.
Value: Use a stronger model (e.g., gpt-4) as the grader when evaluating weaker models.
Trade-off: Stronger grading models cost more per evaluation sample.

Reasoning

Chain-of-thought prompting is well-established as improving LLM reasoning accuracy. In the context of model-graded evaluations, the grading model must make a judgment call about the quality of a response. For simple factual checks (e.g., "Did the response contain the correct answer?"), direct classification suffices. For subjective or multi-criteria evaluations (e.g., "Is this response helpful, accurate, and well-structured?"), CoT reasoning helps the grading model consider multiple factors before rendering judgment.

The `ModelGradedSpec` dataclass captures the evaluation prompt template, the valid choice strings, and whether to use chain-of-thought. This configuration is stored in YAML files under `evals/registry/modelgraded/`.

Code Evidence

ModelGradedSpec definition from `evals/elsuite/modelgraded/base.py:11-26`:

@dataclass
class ModelGradedSpec:
    prompt: Union[str, list[dict[str, str]]]
    choice_strings: Union[list[str], str]
    input_outputs: dict[str, str] = field(default_factory=dict)
    eval_type: str = "classify"
    # ...

Classify function from `evals/elsuite/modelgraded/classify_utils.py:51-87`:

def classify(
    model_spec: ModelSpec,
    prompt: OpenAICreateChatPrompt,
    choice_strings: Union[list[str], str],
    eval_type: str,
    # ...
):
    # eval_type determines whether to use direct classify or cot_classify

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment