Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Openai Evals Custom Eval Implementation

From Leeroopedia
Knowledge Sources
Domains Evaluation, Software_Architecture
Last Updated 2026-02-14 10:00 GMT

Overview

An abstract class pattern that defines the contract for implementing custom evaluation logic through method overriding.

Description

Custom Eval Implementation is the process of creating a new evaluation by subclassing the Eval abstract base class and implementing two required methods: eval_sample (which evaluates a single test case) and run (which orchestrates the full evaluation including data loading, sample iteration, and result aggregation). The base class provides built-in support for parallel execution via eval_all_samples (using ThreadPool), deterministic shuffling, sample limiting, and integration with the recording system. A secondary base class SolverEval extends this pattern for evaluations that require stateful, multi-turn interactions via the Solver interface.

Usage

Implement a custom eval when none of the built-in templates (Match, Includes, FuzzyMatch) fit the evaluation requirements. Common cases include evaluations with custom scoring logic, multi-step reasoning evaluation, or evaluations requiring specialized prompt construction.

Theoretical Basis

The evaluation pattern follows the Template Method design pattern:

  1. The Eval base class defines the skeleton algorithm in eval_all_samples
  2. Subclasses implement eval_sample to define per-sample evaluation logic
  3. Subclasses implement run to define the overall orchestration (typically: load data, call eval_all_samples, aggregate metrics)
  4. The base class handles threading, progress reporting, and recorder integration
# Abstract evaluation contract (pseudocode)
class CustomEval(Eval):
    def eval_sample(self, sample, rng):
        # 1. Extract prompt from sample
        # 2. Call self.completion_fn(prompt) to get model output
        # 3. Compare output to expected answer
        # 4. Record results via recorder methods
        pass

    def run(self, recorder):
        # 1. Load samples via self.get_samples()
        # 2. Call self.eval_all_samples(recorder, samples)
        # 3. Aggregate metrics from recorder.get_events()
        # 4. Return metrics dict
        pass

Key helper methods provided by the base class:

  • eval_all_samples — Parallel sample evaluation with ThreadPool
  • get_samples — Load samples from samples_jsonl path
  • completion_fn — Property for ergonomic single-CompletionFn access
  • record_and_check_match — Utility for standard match recording

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment