Implementation:Openai Evals Classify Function

Knowledge Sources	OpenAI Evals
Domains	Evaluation, LLM_as_Judge
Last Updated	2026-02-14 10:00 GMT

Overview

Concrete tool for running model-graded classification using a spec and completion function provided by the classify_utils module.

Description

The classify function executes a model-graded evaluation for a single sample. It takes a ModelGradedSpec, appends the appropriate answer prompt based on eval_type, invokes the grading completion function, parses the response to extract a choice string, and computes a numeric score. It returns the chosen classification and metadata including score, sampled text, and validity flag.

Usage

Called internally by ModelBasedClassify.eval_sample for each test sample. Can also be used directly for custom model-graded evaluation logic.

Code Reference

Source Location

Repository: openai/evals
File: evals/elsuite/modelgraded/classify_utils.py (lines 51-87)

Signature

def classify(
    mg: ModelGradedSpec,
    completion_fn: CompletionFn,
    completion_kwargs: Optional[dict[str, Any]] = None,
    format_kwargs: Optional[dict[str, Any]] = None,
    eval_type: Optional[str] = None,
    n: Optional[int] = None,
    match_fn: Optional[str] = None,
) -> tuple[str, dict]:
    """
    Run model-graded classification for a single sample.

    Args:
        mg: ModelGradedSpec with prompt template and choice configuration.
        completion_fn: CompletionFn for the grading model.
        completion_kwargs: Extra kwargs for the completion call (e.g. max_tokens).
        format_kwargs: Values to fill prompt template placeholders.
        eval_type: Override eval_type from spec ("classify", "classify_cot", "cot_classify").
        n: Number of completions (for multi-completion specs).
        match_fn: Override match function ("include", "exact", "endswith", "starts_or_endswith").

    Returns:
        Tuple of (choice_string, info_dict) where info_dict contains
        score, sampled text, prompt, and invalid_choice flag.
    """

Import

from evals.elsuite.modelgraded.classify_utils import classify, ANSWER_PROMPTS, MATCH_FNS

I/O Contract

Inputs

Name	Type	Required	Description
mg	ModelGradedSpec	Yes	Evaluation specification with prompt, choices, scoring
completion_fn	CompletionFn	Yes	Grading model completion function
completion_kwargs	dict	No	Extra kwargs for completion call
format_kwargs	dict	No	Values for template placeholders
eval_type	str	No	Classification strategy override
match_fn	str	No	Match function override (default "starts_or_endswith")

Outputs

Name	Type	Description
choice	str	Selected choice string or "__invalid__" if parsing failed
info	dict	Contains: score (float or None), sampled (list[str]), prompt, invalid_choice (bool)

Usage Examples

Direct Classification

from evals.elsuite.modelgraded.classify_utils import classify
from evals.elsuite.modelgraded.base import ModelGradedSpec
from evals.registry import Registry

registry = Registry()
mg = registry.get_modelgraded_spec("fact")
completion_fn = registry.make_completion_fn("gpt-4")

choice, info = classify(
    mg=mg,
    completion_fn=completion_fn,
    format_kwargs={
        "input": "What is the capital of France?",
        "ideal": "Paris",
        "completion": "The capital of France is Paris.",
    },
    eval_type="cot_classify",
)

print(f"Choice: {choice}")  # "Yes"
print(f"Score: {info['score']}")  # 1.0

Related Pages

Implements Principle

Principle:Openai_Evals_Eval_Type_Configuration

Uses Heuristic

Heuristic:Openai_Evals_Model_Graded_Eval_Design

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment