Principle:Openai Evals Custom Eval Implementation

Knowledge Sources	OpenAI Evals Custom Eval Tutorial
Domains	Evaluation, Software_Architecture
Last Updated	2026-02-14 10:00 GMT

Overview

An abstract class pattern that defines the contract for implementing custom evaluation logic through method overriding.

Description

Custom Eval Implementation is the process of creating a new evaluation by subclassing the Eval abstract base class and implementing two required methods: eval_sample (which evaluates a single test case) and run (which orchestrates the full evaluation including data loading, sample iteration, and result aggregation). The base class provides built-in support for parallel execution via eval_all_samples (using ThreadPool), deterministic shuffling, sample limiting, and integration with the recording system. A secondary base class SolverEval extends this pattern for evaluations that require stateful, multi-turn interactions via the Solver interface.

Usage

Implement a custom eval when none of the built-in templates (Match, Includes, FuzzyMatch) fit the evaluation requirements. Common cases include evaluations with custom scoring logic, multi-step reasoning evaluation, or evaluations requiring specialized prompt construction.

Theoretical Basis

The evaluation pattern follows the Template Method design pattern:

The Eval base class defines the skeleton algorithm in eval_all_samples
Subclasses implement eval_sample to define per-sample evaluation logic
Subclasses implement run to define the overall orchestration (typically: load data, call eval_all_samples, aggregate metrics)
The base class handles threading, progress reporting, and recorder integration

# Abstract evaluation contract (pseudocode)
class CustomEval(Eval):
    def eval_sample(self, sample, rng):
        # 1. Extract prompt from sample
        # 2. Call self.completion_fn(prompt) to get model output
        # 3. Compare output to expected answer
        # 4. Record results via recorder methods
        pass

    def run(self, recorder):
        # 1. Load samples via self.get_samples()
        # 2. Call self.eval_all_samples(recorder, samples)
        # 3. Aggregate metrics from recorder.get_events()
        # 4. Return metrics dict
        pass

Key helper methods provided by the base class:

eval_all_samples — Parallel sample evaluation with ThreadPool
get_samples — Load samples from samples_jsonl path
completion_fn — Property for ergonomic single-CompletionFn access
record_and_check_match — Utility for standard match recording

Related Pages

Implemented By

Implementation:Openai_Evals_Eval_Base_Class

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment