Workflow:Openai Evals Building a custom eval
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Data_Engineering, Model_Testing |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
End-to-end process for creating a new evaluation task, from formatting datasets through writing an Eval class to registering and running it via the CLI.
Description
This workflow guides the creation of a custom evaluation for the OpenAI Evals framework. It covers formatting evaluation data into the required JSONL format, choosing between built-in eval templates (Match, Includes, FuzzyMatch, JsonMatch) or writing a custom Eval subclass with bespoke evaluation logic, registering the eval in the YAML registry, and running it against a target model. The output is a new eval that can be executed repeatedly against different models to benchmark capabilities.
Usage
Execute this workflow when you need to evaluate a model capability not covered by the 358 existing eval tasks, or when you require custom scoring logic beyond simple string matching. You should have a dataset of input-output pairs for your evaluation domain and understand which eval pattern (basic template or custom class) best fits your needs.
Execution Steps
Step 1: Prepare the Dataset
Convert evaluation data into JSONL format where each line is a JSON object representing one test sample. Every sample must include an "input" key containing the prompt (preferably in chat format as a list of message objects with role and content fields). For basic eval templates, include an "ideal" key with the expected answer(s). Place the file at evals/registry/data/<eval_name>/samples.jsonl.
Key considerations:
- Use chat format (list of message dicts) for prompts even when evaluating non-chat models
- The framework handles conversion between chat and raw string formats
- For model-graded evals, the required keys depend on the evaluation prompt template
- Data can also be loaded from cloud storage using path-style URLs
Step 2: Choose an Eval Template or Write Custom Logic
Decide whether to use a built-in eval template or implement custom evaluation logic. The framework provides four basic templates: Match (prefix matching), Includes (substring matching), FuzzyMatch (bidirectional containment), and JsonMatch (JSON structural comparison). For more complex evaluations, create a Python class that inherits from the Eval base class and overrides the eval_sample and run methods.
Key considerations:
- Basic templates require no coding, only dataset and YAML configuration
- Custom evals allow arbitrary scoring logic and metrics
- The Eval base class provides eval_all_samples for parallel execution
- SolverEval supports stateful multi-turn evaluation scenarios
- Inspect model completions to decide which template fits best
Step 3: Implement the Eval Class (Custom Only)
For custom evals, create a Python file under evals/elsuite/ containing a class that inherits from evals.Eval. Implement __init__ to accept dataset paths and parameters, eval_sample to process a single test sample (generate prompts, call the completion function, check results), and run to orchestrate the evaluation (load data, call eval_all_samples, aggregate metrics). Use utilities from evals.api, evals.record, and evals.metrics.
Key considerations:
- eval_sample uses the default recorder implicitly; no need to pass it as an argument
- Use record_and_check_match for standard prompt-response-check patterns
- The completion_fn attribute provides access to the model under test
- Use evals.metrics.get_accuracy and similar functions for metric aggregation
Step 4: Register the Eval
Create a YAML file at evals/registry/evals/<eval_name>.yaml that defines the eval entry. The file contains a base eval entry (with id, description, and metrics fields) and the concrete eval spec (with class path and args). The class path uses dotted module notation with a colon separator for the class name. The args should match the constructor parameters of the eval class.
Key considerations:
- Follow the naming convention: eval_name.split.version (e.g., arithmetic.dev.match-v1)
- The base eval entry acts as an alias that dereferences to the full spec
- Bump the version when changing the eval to maintain reproducibility
- For basic templates, specify the class as evals.elsuite.basic.match:Match (or similar)
Step 5: Run and Iterate
Execute the eval using the oaieval CLI, inspect the results and per-sample logs, and iterate on the dataset, prompts, or scoring logic. Verify that the eval produces meaningful differentiation between models or model versions. Check that reference answers are correct and that the eval template matches the expected response format.
Key considerations:
- Use --max_samples for quick iteration during development
- Inspect JSONL logs to understand per-sample behavior
- Ensure the eval is thematically consistent and challenging
- Verify that a human expert could reasonably solve the eval tasks