Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Openai Evals Eval Base Class

From Leeroopedia
Knowledge Sources
Domains Evaluation, Software_Architecture
Last Updated 2026-02-14 10:00 GMT

Overview

Concrete abstract base class for defining custom evaluations provided by the evals framework.

Description

The Eval class is the primary abstract base class that all evaluation implementations must extend. It provides the infrastructure for parallel sample evaluation, data loading, recorder integration, and completion function management. Two abstract methods must be implemented: eval_sample (per-sample logic) and run (overall orchestration). The class uses ThreadPool for parallel execution with configurable thread count via the EVALS_THREADS environment variable (default 10). The companion SolverEval class extends this for stateful solver-based evaluations.

Usage

Subclass Eval when creating any custom evaluation. Override eval_sample and run methods. Use self.completion_fn to access the model and recorder helper functions to log results.

Code Reference

Source Location

Signature

class Eval(abc.ABC):
    def __init__(
        self,
        completion_fns: list[Union[CompletionFn, Solver]],
        eval_registry_path: Path,
        seed: int = 20220722,
        name: str = "no_name_eval.default",
        registry: Optional[Registry] = None,
        samples_jsonl: Optional[str] = None,
    ):
        """
        Args:
            completion_fns: List of CompletionFn or Solver instances to evaluate.
            eval_registry_path: Path to the registry directory for data lookup.
            seed: Random seed for deterministic shuffling (default: 20220722).
            name: Eval name in format "base_eval.split" (e.g. "my-eval.dev").
            registry: Optional Registry instance for spec lookups.
            samples_jsonl: Optional path to default JSONL dataset.
        """

    @abc.abstractmethod
    def eval_sample(self, sample: Any, rng: random.Random):
        """Evaluate a single sample. Must be implemented by subclasses."""

    @abc.abstractmethod
    def run(self, recorder: RecorderBase) -> Dict[str, float]:
        """Run the evaluation. Must be implemented by subclasses."""

    def eval_all_samples(
        self,
        recorder: RecorderBase,
        samples,
        show_progress=True,
        record_raw_sample=True,
        **_kwargs: Any,
    ):
        """Evaluate all samples in parallel using ThreadPool."""

    def get_samples(self) -> list[dict]:
        """Load samples from self.samples_jsonl."""

    @property
    def completion_fn(self) -> CompletionFn:
        """Helper for ergonomic access to a single CompletionFn."""

Import

from evals.eval import Eval, SolverEval
from evals.record import RecorderBase

I/O Contract

Inputs

Name Type Required Description
completion_fns list[CompletionFn] Yes Model(s) to evaluate
eval_registry_path Path Yes Registry path for resolving data file paths
seed int No Random seed (default 20220722)
name str No Eval name in "base.split" format
samples_jsonl str No Path to JSONL dataset

Outputs

Name Type Description
run() returns Dict[str, float] Aggregated metrics (e.g. {"accuracy": 0.85, "bootstrap_std": 0.02})
Recorded events via RecorderBase Match, sampling, and metric events logged during execution

Usage Examples

Minimal Custom Eval

import evals
import evals.metrics
from evals.eval import Eval
from evals.record import RecorderBase

class ArithmeticEval(Eval):
    def __init__(self, completion_fns, samples_jsonl, *args, **kwargs):
        super().__init__(completion_fns, *args, **kwargs)
        self.samples_jsonl = samples_jsonl

    def eval_sample(self, sample, rng):
        prompt = sample["input"]
        result = self.completion_fn(prompt=prompt, temperature=0.0)
        sampled = result.get_completions()[0]
        evals.record_and_check_match(
            prompt=prompt,
            sampled=sampled,
            expected=sample["ideal"],
        )

    def run(self, recorder: RecorderBase):
        samples = self.get_samples()
        self.eval_all_samples(recorder, samples)
        events = recorder.get_events("match")
        return {
            "accuracy": evals.metrics.get_accuracy(events),
        }

Registration for Custom Eval

# In evals/registry/evals/my_eval.yaml
arithmetic:
  id: arithmetic.dev.v0
  metrics: [accuracy]

arithmetic.dev.v0:
  class: my_evals.arithmetic.ArithmeticEval
  args:
    samples_jsonl: my_data/arithmetic.jsonl

Related Pages

Implements Principle

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment