Workflow:Openai Evals Running a single eval

Knowledge Sources	OpenAI Evals OpenAI Evals Run Guide OpenAI Eval Templates
Domains	LLM_Evaluation, Model_Testing
Last Updated	2026-02-14 10:00 GMT

Overview

End-to-end process for running a single language model evaluation against a registered eval using the oaieval CLI tool.

Description

This workflow covers the standard procedure for executing a single model evaluation from the command line. The oaieval CLI parses arguments to identify the completion function (model or custom function) and the eval name, resolves them through the YAML-driven registry system, instantiates the appropriate Eval subclass, runs all samples in parallel using a thread pool, records events via the recorder infrastructure, and outputs a final metrics report as JSON lines. This is the primary "happy path" for anyone wanting to evaluate an LLM against one of the 358 community-contributed evaluation tasks or a custom eval.

Usage

Execute this workflow when you want to evaluate a specific model (e.g., gpt-3.5-turbo, gpt-4) or a custom completion function against a single evaluation task from the registry. You should have an OpenAI API key configured and the evals package installed. Use this when you need a quick, focused assessment of model performance on a specific capability (e.g., math, translation, logic).

Execution Steps

Step 1: Environment Setup

Ensure the Python environment has the evals package installed and the OPENAI_API_KEY environment variable is set. The package requires Python 3.9 or later. If evaluating against data stored with Git-LFS, ensure that the eval data files have been fetched and pulled.

Key considerations:

Install via `pip install evals` for running only, or `pip install -e .` for development
Set OPENAI_API_KEY as an environment variable
Fetch Git-LFS data if evaluating against registry data files

Step 2: Identify the Completion Function

Select the model or completion function to evaluate. This can be a direct OpenAI model name (e.g., gpt-3.5-turbo), which is dynamically instantiated as a CompletionFn, or a registered completion function key from the YAML files in the registry. Multiple completion functions can be specified as a comma-separated list.

Key considerations:

Chat models (gpt-3.5-turbo, gpt-4) are auto-detected and wrapped in OpenAIChatCompletionFn
Non-chat models use OpenAICompletionFn
Custom completion functions must be registered in evals/registry/completion_fns/
Solver-based completion functions are also supported via the registry

Step 3: Select the Eval

Choose the evaluation task to run by its registered name. Eval names follow the convention eval_name.split.version (e.g., arithmetic.dev.match-v1). The registry resolves the name through YAML alias chains to find the concrete eval spec, which identifies the Eval class and its arguments (including the dataset path).

Key considerations:

Valid eval names are defined in YAML files under evals/registry/evals/
Eval aliases allow shorthand names that dereference to full specs
The eval spec contains the class path, dataset location, and evaluation parameters

Step 4: Execute the Eval via CLI

Run the oaieval command with the chosen completion function and eval name. The CLI builds a run configuration, instantiates a Recorder (local JSONL by default), creates the Eval class instance, and calls its run method. The eval loads dataset samples and dispatches them to eval_sample in parallel via a thread pool (default 10 threads).

Key considerations:

Use --max_samples to limit the number of samples for quick testing
Use --record_path to specify a custom output location
Threading is configurable via the EVALS_THREADS environment variable
The --dry-run flag runs without making API calls

Step 5: Review Results

After execution completes, the CLI prints a final report with metrics (e.g., accuracy). Detailed event logs are written to a JSONL file at the record path (default /tmp/evallogs/). Each line in the log represents an event: sampling calls, match results, and the final summary. Token usage statistics are aggregated from sampling events when available.

Key considerations:

The final report shows key metrics like accuracy, precision, recall
JSONL logs contain per-sample details for deep analysis
Third-party tools can be used to visualize the logs
Token usage is reported when the API returns usage data

Execution Diagram

GitHub URL

Workflow Repository