Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Openai Evals Running an eval set

From Leeroopedia
Knowledge Sources
Domains LLM_Evaluation, Model_Testing, Batch_Processing
Last Updated 2026-02-14 10:00 GMT

Overview

End-to-end process for running a batch of evaluations against a model using the oaievalset CLI tool, with built-in progress tracking and resume capability.

Description

This workflow covers executing multiple evaluations as a batch using the oaievalset CLI. Eval sets are defined in YAML files under evals/registry/eval_sets/ and group related evaluation tasks together (e.g., all Chinese number conversions, all logical reasoning tasks, or a comprehensive test suite). The CLI resolves each eval in the set, generates individual oaieval commands, tracks progress to a file, and supports resuming interrupted runs. This enables systematic benchmarking across multiple capabilities in a single session.

Usage

Execute this workflow when you need to benchmark a model across a suite of related evaluations rather than running them individually. This is ideal for comprehensive model assessments, regression testing across model versions, or evaluating a model across an entire capability domain (e.g., all multilingual evals, all logic evals).

Execution Steps

Step 1: Environment Setup

Ensure the evals package is installed and the OPENAI_API_KEY is configured. If running large eval sets, consider the API cost implications and rate limits. Configure thread count and timeout via EVALS_THREADS and EVALS_THREAD_TIMEOUT environment variables if needed.

Key considerations:

  • Default is 10 threads per eval; each thread times out after 40 seconds
  • Increase thread count for faster execution, but respect rate limits
  • Increase timeout for evals with long prompts or responses
  • Be aware of API costs when running large eval sets

Step 2: Select the Eval Set

Choose an eval set from the registry. Eval sets are defined in YAML files under evals/registry/eval_sets/ and contain a list of eval names (potentially with wildcard patterns). The framework ships with 18 pre-defined eval sets covering topics from Chinese numbers to maze solving to stock options analysis.

Key considerations:

  • Eval set names map to YAML files in evals/registry/eval_sets/
  • Each set groups semantically related evaluation tasks
  • Custom eval sets can be created by adding new YAML files
  • Wildcard patterns in eval sets can match multiple eval variants

Step 3: Execute the Eval Set

Run the oaievalset command with the model name and eval set name. The CLI resolves all evals in the set, generates oaieval sub-commands for each, and executes them sequentially. Progress is tracked in a file at /tmp/oaievalset/{model}.{eval_set}.progress.txt, enabling automatic resume if the run is interrupted.

Key considerations:

  • Individual evals within the set cannot be resumed from the middle
  • Progress tracking allows re-running the command to pick up where it left off
  • Delete the progress file to restart from the beginning
  • Additional oaieval arguments can be passed through to individual runs

Step 4: Monitor and Resume

Monitor the sequential execution of each eval in the set. If the run is interrupted (crash, manual stop, or error), simply re-run the same oaievalset command to resume from the last completed eval. The progress file records completed eval commands, and subsequent runs skip them automatically.

Key considerations:

  • Use --no-exit-on-error to continue the set even if individual evals fail
  • The progress file is a plain text JSONL format showing completed commands
  • Each eval produces its own JSONL log file in /tmp/evallogs/

Step 5: Aggregate and Analyze Results

After all evals complete, review the individual result logs in /tmp/evallogs/. Each eval run produces a separate JSONL file with its final report metrics. Aggregate these results to build a comprehensive picture of model performance across the eval set.

Key considerations:

  • Results are spread across multiple JSONL files, one per eval
  • Use programmatic analysis or third-party visualization tools
  • Compare results across model versions for regression detection

Execution Diagram

GitHub URL

Workflow Repository