Heuristic:Openai Evals Eval Resumption Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Debugging |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Strategy for resuming interrupted eval set runs using the built-in progress tracking file at `/tmp/oaievalset/`.
Description
The `oaievalset` CLI tracks completed evaluations in a progress file stored at `/tmp/oaievalset/{model}.{eval_set}.progress.txt`. When an eval set run is interrupted (crash, user stop, network failure), rerunning the same `oaievalset` command will automatically skip already-completed evals and resume from where it left off. This is a built-in feature of the `Progress` class in `evals/cli/oaievalset.py`. However, individual evals within the set cannot be resumed mid-execution; they must restart from the beginning.
Usage
Use this heuristic when running large eval sets that may take extended periods, when running evals on unreliable infrastructure, or when you need to stop and restart eval runs without losing progress.
The Insight (Rule of Thumb)
- Action: Simply re-run the same `oaievalset` command after an interruption.
- Value: The framework automatically detects completed evals via the progress file.
- Trade-off: Individual evals that were in-progress when the interruption occurred will restart from the beginning. Keep individual evals quick to minimize wasted work.
- Action: To force a fresh start, delete the progress file at `/tmp/oaievalset/{model}.{eval_set}.progress.txt`.
- Value: N/A (boolean decision).
- Trade-off: All completed evals will be re-run, consuming additional time and API credits.
- Action: Keep individual evals short so that re-running a single interrupted eval has minimal cost.
- Value: The documentation explicitly advises: "try to keep your individual evals quick to run."
- Trade-off: Short evals mean fewer samples per eval, potentially requiring more eval configs to maintain coverage.
Reasoning
Eval sets can contain dozens or hundreds of individual evaluations, each involving multiple API calls. Running an entire eval set against a large model can take hours. Without resumption capability, any interruption would require restarting the entire set from scratch, wasting significant time and API costs. The `Progress` class solves this at the eval-set level by recording which evals completed successfully. However, the framework does not checkpoint within a single eval execution, so keeping individual evals short minimizes the blast radius of interruptions.
Code Evidence
Progress tracking from `evals/cli/oaievalset.py:17-40`:
class Progress:
def __init__(self, file: str):
self.file = file
self.completed: set[str] = set()
if os.path.exists(file):
with open(file, "r") as f:
for line in f:
self.completed.add(line.strip())
def add(self, item: str):
self.completed.add(item)
with open(self.file, "a") as f:
f.write(item + "\n")
Documentation from `docs/run-evals.md:38-40`:
If you have to stop your run or your run crashes, we've got you covered!
oaievalset records the evals that finished in
/tmp/oaievalset/{model}.{eval_set}.progress.txt.
You can simply rerun the command to pick up where you left off.
Unfortunately, you can't resume a single eval from the middle.
You'll have to restart from the beginning, so try to keep your
individual evals quick to run.