Implementation:Openai Evals ModelBasedClassify
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, LLM_as_Judge |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Concrete eval class for running LLM-as-judge evaluations using configurable model-graded specifications provided by the evals modelgraded module.
Description
The ModelBasedClassify class extends Eval to implement model-graded evaluation. It loads a ModelGradedSpec from the registry, runs the subject model to generate completions, then uses a grading model (the last completion_fn if multiple are provided) to classify each completion. Per-sample results include the choice string and numeric score. The run method aggregates results into per-choice counts and an average score. Optional metaeval mode compares grading decisions against ground-truth labels.
Usage
Use ModelBasedClassify when registering model-graded evals in YAML. Reference it as "evals.elsuite.modelgraded.classify.ModelBasedClassify" with a modelgraded_spec argument pointing to a spec name.
Code Reference
Source Location
- Repository: openai/evals
- File: evals/elsuite/modelgraded/classify.py (lines 14-127)
Signature
class ModelBasedClassify(evals.Eval):
def __init__(
self,
modelgraded_spec: str,
*args,
modelgraded_spec_args: Optional[dict[str, dict[str, str]]] = None,
sample_kwargs: Optional[dict[str, Any]] = None,
eval_kwargs: Optional[dict[str, Any]] = None,
multicomp_n: Union[int, str] = 1,
eval_type: Optional[str] = None,
match_fn: Optional[str] = None,
metaeval: bool = False,
**kwargs,
):
"""
Args:
modelgraded_spec: Name of spec YAML in evals/registry/modelgraded/ (e.g. "fact").
modelgraded_spec_args: Extra args merged into format_kwargs.
sample_kwargs: Kwargs for subject model completion (default max_tokens=1024).
eval_kwargs: Kwargs for grading model completion (default max_tokens=1024).
multicomp_n: Number of completions to compare (1 or "from_models").
eval_type: Override eval_type from spec.
match_fn: Override match function.
metaeval: Enable meta-evaluation against ground-truth labels.
"""
def eval_sample(self, test_sample: dict, rng: Random) -> str:
"""Evaluate single sample: generate completion, classify, record metrics."""
def run(self, recorder) -> dict:
"""Run all samples and aggregate choice counts and average score."""
Import
from evals.elsuite.modelgraded.classify import ModelBasedClassify
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| modelgraded_spec | str | Yes | Name of model-graded spec (e.g. "fact", "closedqa") |
| completion_fns | list[CompletionFn] | Yes | Subject model(s); last one used as grading model if multiple |
| samples_jsonl | str | Yes (via args) | Path to JSONL dataset |
| eval_type | str | No | Classification strategy override |
| multicomp_n | Union[int, str] | No | Number of completions (default 1) |
| metaeval | bool | No | Enable meta-evaluation (default False) |
Outputs
| Name | Type | Description |
|---|---|---|
| run() returns | dict | {"score": float, "counts/Yes": int, "counts/No": int, ...} and optionally "metascore" |
Usage Examples
YAML Registration
# In evals/registry/evals/my_graded.yaml
my-fact-eval:
id: my-fact-eval.dev.v0
metrics: [score]
my-fact-eval.dev.v0:
class: evals.elsuite.modelgraded.classify.ModelBasedClassify
args:
samples_jsonl: my_data/facts.jsonl
modelgraded_spec: fact
eval_type: cot_classify
Running via CLI
# Subject model is gpt-3.5-turbo, grading model defaults to same
oaieval gpt-3.5-turbo my-fact-eval
# Use different subject and grading models (comma-separated)
oaieval gpt-3.5-turbo,gpt-4 my-fact-eval
# With meta-evaluation
oaieval gpt-3.5-turbo my-fact-eval --extra_eval_params "metaeval=True"