Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Openai Evals Eval Resumption Strategy

From Leeroopedia
Knowledge Sources
Domains Optimization, Debugging
Last Updated 2026-02-14 10:00 GMT

Overview

Strategy for resuming interrupted eval set runs using the built-in progress tracking file at `/tmp/oaievalset/`.

Description

The `oaievalset` CLI tracks completed evaluations in a progress file stored at `/tmp/oaievalset/{model}.{eval_set}.progress.txt`. When an eval set run is interrupted (crash, user stop, network failure), rerunning the same `oaievalset` command will automatically skip already-completed evals and resume from where it left off. This is a built-in feature of the `Progress` class in `evals/cli/oaievalset.py`. However, individual evals within the set cannot be resumed mid-execution; they must restart from the beginning.

Usage

Use this heuristic when running large eval sets that may take extended periods, when running evals on unreliable infrastructure, or when you need to stop and restart eval runs without losing progress.

The Insight (Rule of Thumb)

  • Action: Simply re-run the same `oaievalset` command after an interruption.
  • Value: The framework automatically detects completed evals via the progress file.
  • Trade-off: Individual evals that were in-progress when the interruption occurred will restart from the beginning. Keep individual evals quick to minimize wasted work.
  • Action: To force a fresh start, delete the progress file at `/tmp/oaievalset/{model}.{eval_set}.progress.txt`.
  • Value: N/A (boolean decision).
  • Trade-off: All completed evals will be re-run, consuming additional time and API credits.
  • Action: Keep individual evals short so that re-running a single interrupted eval has minimal cost.
  • Value: The documentation explicitly advises: "try to keep your individual evals quick to run."
  • Trade-off: Short evals mean fewer samples per eval, potentially requiring more eval configs to maintain coverage.

Reasoning

Eval sets can contain dozens or hundreds of individual evaluations, each involving multiple API calls. Running an entire eval set against a large model can take hours. Without resumption capability, any interruption would require restarting the entire set from scratch, wasting significant time and API costs. The `Progress` class solves this at the eval-set level by recording which evals completed successfully. However, the framework does not checkpoint within a single eval execution, so keeping individual evals short minimizes the blast radius of interruptions.

Code Evidence

Progress tracking from `evals/cli/oaievalset.py:17-40`:

class Progress:
    def __init__(self, file: str):
        self.file = file
        self.completed: set[str] = set()
        if os.path.exists(file):
            with open(file, "r") as f:
                for line in f:
                    self.completed.add(line.strip())

    def add(self, item: str):
        self.completed.add(item)
        with open(self.file, "a") as f:
            f.write(item + "\n")

Documentation from `docs/run-evals.md:38-40`:

If you have to stop your run or your run crashes, we've got you covered!
oaievalset records the evals that finished in
/tmp/oaievalset/{model}.{eval_set}.progress.txt.
You can simply rerun the command to pick up where you left off.
Unfortunately, you can't resume a single eval from the middle.
You'll have to restart from the beginning, so try to keep your
individual evals quick to run.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment