Workflow:Openai Evals Running a single eval
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Model_Testing |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
End-to-end process for running a single language model evaluation against a registered eval using the oaieval CLI tool.
Description
This workflow covers the standard procedure for executing a single model evaluation from the command line. The oaieval CLI parses arguments to identify the completion function (model or custom function) and the eval name, resolves them through the YAML-driven registry system, instantiates the appropriate Eval subclass, runs all samples in parallel using a thread pool, records events via the recorder infrastructure, and outputs a final metrics report as JSON lines. This is the primary "happy path" for anyone wanting to evaluate an LLM against one of the 358 community-contributed evaluation tasks or a custom eval.
Usage
Execute this workflow when you want to evaluate a specific model (e.g., gpt-3.5-turbo, gpt-4) or a custom completion function against a single evaluation task from the registry. You should have an OpenAI API key configured and the evals package installed. Use this when you need a quick, focused assessment of model performance on a specific capability (e.g., math, translation, logic).
Execution Steps
Step 1: Environment Setup
Ensure the Python environment has the evals package installed and the OPENAI_API_KEY environment variable is set. The package requires Python 3.9 or later. If evaluating against data stored with Git-LFS, ensure that the eval data files have been fetched and pulled.
Key considerations:
- Install via `pip install evals` for running only, or `pip install -e .` for development
- Set OPENAI_API_KEY as an environment variable
- Fetch Git-LFS data if evaluating against registry data files
Step 2: Identify the Completion Function
Select the model or completion function to evaluate. This can be a direct OpenAI model name (e.g., gpt-3.5-turbo), which is dynamically instantiated as a CompletionFn, or a registered completion function key from the YAML files in the registry. Multiple completion functions can be specified as a comma-separated list.
Key considerations:
- Chat models (gpt-3.5-turbo, gpt-4) are auto-detected and wrapped in OpenAIChatCompletionFn
- Non-chat models use OpenAICompletionFn
- Custom completion functions must be registered in evals/registry/completion_fns/
- Solver-based completion functions are also supported via the registry
Step 3: Select the Eval
Choose the evaluation task to run by its registered name. Eval names follow the convention eval_name.split.version (e.g., arithmetic.dev.match-v1). The registry resolves the name through YAML alias chains to find the concrete eval spec, which identifies the Eval class and its arguments (including the dataset path).
Key considerations:
- Valid eval names are defined in YAML files under evals/registry/evals/
- Eval aliases allow shorthand names that dereference to full specs
- The eval spec contains the class path, dataset location, and evaluation parameters
Step 4: Execute the Eval via CLI
Run the oaieval command with the chosen completion function and eval name. The CLI builds a run configuration, instantiates a Recorder (local JSONL by default), creates the Eval class instance, and calls its run method. The eval loads dataset samples and dispatches them to eval_sample in parallel via a thread pool (default 10 threads).
Key considerations:
- Use --max_samples to limit the number of samples for quick testing
- Use --record_path to specify a custom output location
- Threading is configurable via the EVALS_THREADS environment variable
- The --dry-run flag runs without making API calls
Step 5: Review Results
After execution completes, the CLI prints a final report with metrics (e.g., accuracy). Detailed event logs are written to a JSONL file at the record path (default /tmp/evallogs/). Each line in the log represents an event: sampling calls, match results, and the final summary. Token usage statistics are aggregated from sampling events when available.
Key considerations:
- The final report shows key metrics like accuracy, precision, recall
- JSONL logs contain per-sample details for deep analysis
- Third-party tools can be used to visualize the logs
- Token usage is reported when the API returns usage data