Workflow:Openai Evals Running an eval set

Knowledge Sources	OpenAI Evals Run Evals Guide
Domains	LLM_Evaluation, Model_Testing, Batch_Processing
Last Updated	2026-02-14 10:00 GMT

Overview

End-to-end process for running a batch of evaluations against a model using the oaievalset CLI tool, with built-in progress tracking and resume capability.

Description

This workflow covers executing multiple evaluations as a batch using the oaievalset CLI. Eval sets are defined in YAML files under evals/registry/eval_sets/ and group related evaluation tasks together (e.g., all Chinese number conversions, all logical reasoning tasks, or a comprehensive test suite). The CLI resolves each eval in the set, generates individual oaieval commands, tracks progress to a file, and supports resuming interrupted runs. This enables systematic benchmarking across multiple capabilities in a single session.

Usage

Execute this workflow when you need to benchmark a model across a suite of related evaluations rather than running them individually. This is ideal for comprehensive model assessments, regression testing across model versions, or evaluating a model across an entire capability domain (e.g., all multilingual evals, all logic evals).

Execution Steps

Step 1: Environment Setup

Ensure the evals package is installed and the OPENAI_API_KEY is configured. If running large eval sets, consider the API cost implications and rate limits. Configure thread count and timeout via EVALS_THREADS and EVALS_THREAD_TIMEOUT environment variables if needed.

Key considerations:

Default is 10 threads per eval; each thread times out after 40 seconds
Increase thread count for faster execution, but respect rate limits
Increase timeout for evals with long prompts or responses
Be aware of API costs when running large eval sets

Step 2: Select the Eval Set

Choose an eval set from the registry. Eval sets are defined in YAML files under evals/registry/eval_sets/ and contain a list of eval names (potentially with wildcard patterns). The framework ships with 18 pre-defined eval sets covering topics from Chinese numbers to maze solving to stock options analysis.

Key considerations:

Eval set names map to YAML files in evals/registry/eval_sets/
Each set groups semantically related evaluation tasks
Custom eval sets can be created by adding new YAML files
Wildcard patterns in eval sets can match multiple eval variants

Step 3: Execute the Eval Set

Run the oaievalset command with the model name and eval set name. The CLI resolves all evals in the set, generates oaieval sub-commands for each, and executes them sequentially. Progress is tracked in a file at /tmp/oaievalset/{model}.{eval_set}.progress.txt, enabling automatic resume if the run is interrupted.

Key considerations:

Individual evals within the set cannot be resumed from the middle
Progress tracking allows re-running the command to pick up where it left off
Delete the progress file to restart from the beginning
Additional oaieval arguments can be passed through to individual runs

Step 4: Monitor and Resume

Monitor the sequential execution of each eval in the set. If the run is interrupted (crash, manual stop, or error), simply re-run the same oaievalset command to resume from the last completed eval. The progress file records completed eval commands, and subsequent runs skip them automatically.

Key considerations:

Use --no-exit-on-error to continue the set even if individual evals fail
The progress file is a plain text JSONL format showing completed commands
Each eval produces its own JSONL log file in /tmp/evallogs/

Step 5: Aggregate and Analyze Results

After all evals complete, review the individual result logs in /tmp/evallogs/. Each eval run produces a separate JSONL file with its final report metrics. Aggregate these results to build a comprehensive picture of model performance across the eval set.

Key considerations:

Results are spread across multiple JSONL files, one per eval
Use programmatic analysis or third-party visualization tools
Compare results across model versions for regression detection

Execution Diagram

GitHub URL

Workflow Repository