Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Evals ModelBasedClassify

From Leeroopedia
Revision as of 13:34, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Openai_Evals_ModelBasedClassify.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Evaluation, LLM_as_Judge
Last Updated 2026-02-14 10:00 GMT

Overview

Concrete eval class for running LLM-as-judge evaluations using configurable model-graded specifications provided by the evals modelgraded module.

Description

The ModelBasedClassify class extends Eval to implement model-graded evaluation. It loads a ModelGradedSpec from the registry, runs the subject model to generate completions, then uses a grading model (the last completion_fn if multiple are provided) to classify each completion. Per-sample results include the choice string and numeric score. The run method aggregates results into per-choice counts and an average score. Optional metaeval mode compares grading decisions against ground-truth labels.

Usage

Use ModelBasedClassify when registering model-graded evals in YAML. Reference it as "evals.elsuite.modelgraded.classify.ModelBasedClassify" with a modelgraded_spec argument pointing to a spec name.

Code Reference

Source Location

  • Repository: openai/evals
  • File: evals/elsuite/modelgraded/classify.py (lines 14-127)

Signature

class ModelBasedClassify(evals.Eval):
    def __init__(
        self,
        modelgraded_spec: str,
        *args,
        modelgraded_spec_args: Optional[dict[str, dict[str, str]]] = None,
        sample_kwargs: Optional[dict[str, Any]] = None,
        eval_kwargs: Optional[dict[str, Any]] = None,
        multicomp_n: Union[int, str] = 1,
        eval_type: Optional[str] = None,
        match_fn: Optional[str] = None,
        metaeval: bool = False,
        **kwargs,
    ):
        """
        Args:
            modelgraded_spec: Name of spec YAML in evals/registry/modelgraded/ (e.g. "fact").
            modelgraded_spec_args: Extra args merged into format_kwargs.
            sample_kwargs: Kwargs for subject model completion (default max_tokens=1024).
            eval_kwargs: Kwargs for grading model completion (default max_tokens=1024).
            multicomp_n: Number of completions to compare (1 or "from_models").
            eval_type: Override eval_type from spec.
            match_fn: Override match function.
            metaeval: Enable meta-evaluation against ground-truth labels.
        """

    def eval_sample(self, test_sample: dict, rng: Random) -> str:
        """Evaluate single sample: generate completion, classify, record metrics."""

    def run(self, recorder) -> dict:
        """Run all samples and aggregate choice counts and average score."""

Import

from evals.elsuite.modelgraded.classify import ModelBasedClassify

I/O Contract

Inputs

Name Type Required Description
modelgraded_spec str Yes Name of model-graded spec (e.g. "fact", "closedqa")
completion_fns list[CompletionFn] Yes Subject model(s); last one used as grading model if multiple
samples_jsonl str Yes (via args) Path to JSONL dataset
eval_type str No Classification strategy override
multicomp_n Union[int, str] No Number of completions (default 1)
metaeval bool No Enable meta-evaluation (default False)

Outputs

Name Type Description
run() returns dict {"score": float, "counts/Yes": int, "counts/No": int, ...} and optionally "metascore"

Usage Examples

YAML Registration

# In evals/registry/evals/my_graded.yaml
my-fact-eval:
  id: my-fact-eval.dev.v0
  metrics: [score]

my-fact-eval.dev.v0:
  class: evals.elsuite.modelgraded.classify.ModelBasedClassify
  args:
    samples_jsonl: my_data/facts.jsonl
    modelgraded_spec: fact
    eval_type: cot_classify

Running via CLI

# Subject model is gpt-3.5-turbo, grading model defaults to same
oaieval gpt-3.5-turbo my-fact-eval

# Use different subject and grading models (comma-separated)
oaieval gpt-3.5-turbo,gpt-4 my-fact-eval

# With meta-evaluation
oaieval gpt-3.5-turbo my-fact-eval --extra_eval_params "metaeval=True"

Related Pages

Implements Principle

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment