Principle:Openai Evals Custom Eval Implementation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Software_Architecture |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
An abstract class pattern that defines the contract for implementing custom evaluation logic through method overriding.
Description
Custom Eval Implementation is the process of creating a new evaluation by subclassing the Eval abstract base class and implementing two required methods: eval_sample (which evaluates a single test case) and run (which orchestrates the full evaluation including data loading, sample iteration, and result aggregation). The base class provides built-in support for parallel execution via eval_all_samples (using ThreadPool), deterministic shuffling, sample limiting, and integration with the recording system. A secondary base class SolverEval extends this pattern for evaluations that require stateful, multi-turn interactions via the Solver interface.
Usage
Implement a custom eval when none of the built-in templates (Match, Includes, FuzzyMatch) fit the evaluation requirements. Common cases include evaluations with custom scoring logic, multi-step reasoning evaluation, or evaluations requiring specialized prompt construction.
Theoretical Basis
The evaluation pattern follows the Template Method design pattern:
- The Eval base class defines the skeleton algorithm in eval_all_samples
- Subclasses implement eval_sample to define per-sample evaluation logic
- Subclasses implement run to define the overall orchestration (typically: load data, call eval_all_samples, aggregate metrics)
- The base class handles threading, progress reporting, and recorder integration
# Abstract evaluation contract (pseudocode)
class CustomEval(Eval):
def eval_sample(self, sample, rng):
# 1. Extract prompt from sample
# 2. Call self.completion_fn(prompt) to get model output
# 3. Compare output to expected answer
# 4. Record results via recorder methods
pass
def run(self, recorder):
# 1. Load samples via self.get_samples()
# 2. Call self.eval_all_samples(recorder, samples)
# 3. Aggregate metrics from recorder.get_events()
# 4. Return metrics dict
pass
Key helper methods provided by the base class:
- eval_all_samples — Parallel sample evaluation with ThreadPool
- get_samples — Load samples from samples_jsonl path
- completion_fn — Property for ergonomic single-CompletionFn access
- record_and_check_match — Utility for standard match recording