Workflow:Openai Evals Building a custom eval

Knowledge Sources	OpenAI Evals Build Eval Guide Custom Eval Tutorial Eval Templates
Domains	LLM_Evaluation, Data_Engineering, Model_Testing
Last Updated	2026-02-14 10:00 GMT

Overview

End-to-end process for creating a new evaluation task, from formatting datasets through writing an Eval class to registering and running it via the CLI.

Description

This workflow guides the creation of a custom evaluation for the OpenAI Evals framework. It covers formatting evaluation data into the required JSONL format, choosing between built-in eval templates (Match, Includes, FuzzyMatch, JsonMatch) or writing a custom Eval subclass with bespoke evaluation logic, registering the eval in the YAML registry, and running it against a target model. The output is a new eval that can be executed repeatedly against different models to benchmark capabilities.

Usage

Execute this workflow when you need to evaluate a model capability not covered by the 358 existing eval tasks, or when you require custom scoring logic beyond simple string matching. You should have a dataset of input-output pairs for your evaluation domain and understand which eval pattern (basic template or custom class) best fits your needs.

Execution Steps

Step 1: Prepare the Dataset

Convert evaluation data into JSONL format where each line is a JSON object representing one test sample. Every sample must include an "input" key containing the prompt (preferably in chat format as a list of message objects with role and content fields). For basic eval templates, include an "ideal" key with the expected answer(s). Place the file at evals/registry/data/<eval_name>/samples.jsonl.

Key considerations:

Use chat format (list of message dicts) for prompts even when evaluating non-chat models
The framework handles conversion between chat and raw string formats
For model-graded evals, the required keys depend on the evaluation prompt template
Data can also be loaded from cloud storage using path-style URLs

Step 2: Choose an Eval Template or Write Custom Logic

Decide whether to use a built-in eval template or implement custom evaluation logic. The framework provides four basic templates: Match (prefix matching), Includes (substring matching), FuzzyMatch (bidirectional containment), and JsonMatch (JSON structural comparison). For more complex evaluations, create a Python class that inherits from the Eval base class and overrides the eval_sample and run methods.

Key considerations:

Basic templates require no coding, only dataset and YAML configuration
Custom evals allow arbitrary scoring logic and metrics
The Eval base class provides eval_all_samples for parallel execution
SolverEval supports stateful multi-turn evaluation scenarios
Inspect model completions to decide which template fits best

Step 3: Implement the Eval Class (Custom Only)

For custom evals, create a Python file under evals/elsuite/ containing a class that inherits from evals.Eval. Implement __init__ to accept dataset paths and parameters, eval_sample to process a single test sample (generate prompts, call the completion function, check results), and run to orchestrate the evaluation (load data, call eval_all_samples, aggregate metrics). Use utilities from evals.api, evals.record, and evals.metrics.

Key considerations:

eval_sample uses the default recorder implicitly; no need to pass it as an argument
Use record_and_check_match for standard prompt-response-check patterns
The completion_fn attribute provides access to the model under test
Use evals.metrics.get_accuracy and similar functions for metric aggregation

Step 4: Register the Eval

Create a YAML file at evals/registry/evals/<eval_name>.yaml that defines the eval entry. The file contains a base eval entry (with id, description, and metrics fields) and the concrete eval spec (with class path and args). The class path uses dotted module notation with a colon separator for the class name. The args should match the constructor parameters of the eval class.

Key considerations:

Follow the naming convention: eval_name.split.version (e.g., arithmetic.dev.match-v1)
The base eval entry acts as an alias that dereferences to the full spec
Bump the version when changing the eval to maintain reproducibility
For basic templates, specify the class as evals.elsuite.basic.match:Match (or similar)

Step 5: Run and Iterate

Execute the eval using the oaieval CLI, inspect the results and per-sample logs, and iterate on the dataset, prompts, or scoring logic. Verify that the eval produces meaningful differentiation between models or model versions. Check that reference answers are correct and that the eval template matches the expected response format.

Key considerations:

Use --max_samples for quick iteration during development
Inspect JSONL logs to understand per-sample behavior
Ensure the eval is thematically consistent and challenging
Verify that a human expert could reasonably solve the eval tasks

Execution Diagram

GitHub URL

Workflow Repository