Workflow:Openai Evals Running an eval set
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Model_Testing, Batch_Processing |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
End-to-end process for running a batch of evaluations against a model using the oaievalset CLI tool, with built-in progress tracking and resume capability.
Description
This workflow covers executing multiple evaluations as a batch using the oaievalset CLI. Eval sets are defined in YAML files under evals/registry/eval_sets/ and group related evaluation tasks together (e.g., all Chinese number conversions, all logical reasoning tasks, or a comprehensive test suite). The CLI resolves each eval in the set, generates individual oaieval commands, tracks progress to a file, and supports resuming interrupted runs. This enables systematic benchmarking across multiple capabilities in a single session.
Usage
Execute this workflow when you need to benchmark a model across a suite of related evaluations rather than running them individually. This is ideal for comprehensive model assessments, regression testing across model versions, or evaluating a model across an entire capability domain (e.g., all multilingual evals, all logic evals).
Execution Steps
Step 1: Environment Setup
Ensure the evals package is installed and the OPENAI_API_KEY is configured. If running large eval sets, consider the API cost implications and rate limits. Configure thread count and timeout via EVALS_THREADS and EVALS_THREAD_TIMEOUT environment variables if needed.
Key considerations:
- Default is 10 threads per eval; each thread times out after 40 seconds
- Increase thread count for faster execution, but respect rate limits
- Increase timeout for evals with long prompts or responses
- Be aware of API costs when running large eval sets
Step 2: Select the Eval Set
Choose an eval set from the registry. Eval sets are defined in YAML files under evals/registry/eval_sets/ and contain a list of eval names (potentially with wildcard patterns). The framework ships with 18 pre-defined eval sets covering topics from Chinese numbers to maze solving to stock options analysis.
Key considerations:
- Eval set names map to YAML files in evals/registry/eval_sets/
- Each set groups semantically related evaluation tasks
- Custom eval sets can be created by adding new YAML files
- Wildcard patterns in eval sets can match multiple eval variants
Step 3: Execute the Eval Set
Run the oaievalset command with the model name and eval set name. The CLI resolves all evals in the set, generates oaieval sub-commands for each, and executes them sequentially. Progress is tracked in a file at /tmp/oaievalset/{model}.{eval_set}.progress.txt, enabling automatic resume if the run is interrupted.
Key considerations:
- Individual evals within the set cannot be resumed from the middle
- Progress tracking allows re-running the command to pick up where it left off
- Delete the progress file to restart from the beginning
- Additional oaieval arguments can be passed through to individual runs
Step 4: Monitor and Resume
Monitor the sequential execution of each eval in the set. If the run is interrupted (crash, manual stop, or error), simply re-run the same oaievalset command to resume from the last completed eval. The progress file records completed eval commands, and subsequent runs skip them automatically.
Key considerations:
- Use --no-exit-on-error to continue the set even if individual evals fail
- The progress file is a plain text JSONL format showing completed commands
- Each eval produces its own JSONL log file in /tmp/evallogs/
Step 5: Aggregate and Analyze Results
After all evals complete, review the individual result logs in /tmp/evallogs/. Each eval run produces a separate JSONL file with its final report metrics. Aggregate these results to build a comprehensive picture of model performance across the eval set.
Key considerations:
- Results are spread across multiple JSONL files, one per eval
- Use programmatic analysis or third-party visualization tools
- Compare results across model versions for regression detection