Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Openai Evals Thread Tuning

From Leeroopedia
Knowledge Sources
Domains Optimization, LLM_Evaluation
Last Updated 2026-02-14 10:00 GMT

Overview

Performance tuning technique using EVALS_THREADS, EVALS_THREAD_TIMEOUT, and EVALS_SEQUENTIAL to balance eval speed against API rate limits and debugging needs.

Description

The OpenAI Evals framework uses a `ThreadPool` from Python's `multiprocessing.pool` to execute eval samples in parallel. Three environment variables control this behavior: `EVALS_THREADS` sets the number of concurrent threads (default 10), `EVALS_THREAD_TIMEOUT` sets the per-thread timeout in seconds (default 40), and `EVALS_SEQUENTIAL` forces sequential execution when set to `1`. Tuning these values is critical for balancing throughput against API rate limits, debugging needs, and cost management.

Usage

Use this heuristic when you need to optimize eval execution speed or when encountering API rate limit errors, timeout failures, or debugging issues during eval runs. It is particularly relevant for large-scale eval sets run via `oaievalset`.

The Insight (Rule of Thumb)

  • Action: Tune `EVALS_THREADS` based on your OpenAI API rate limit tier.
  • Value: Default is `10` threads. Increase to `20-42` for higher rate limit tiers; decrease to `1-5` for lower tiers or when debugging.
  • Trade-off: More threads = faster execution but higher risk of rate limit errors and increased API costs.
  • Action: Increase `EVALS_THREAD_TIMEOUT` for evals with long prompts or responses.
  • Value: Default is `40` seconds. Set to `120-600` for complex evals.
  • Trade-off: Higher timeout = fewer false failures but slower detection of actual stuck threads.
  • Action: Set `EVALS_SEQUENTIAL=1` when debugging or using interactive solvers.
  • Value: `1`, `true`, or `yes` to enable; default `0` for parallel mode.
  • Trade-off: Sequential mode is dramatically slower but essential for debugging, human-in-the-loop evals, and providers with threading issues (Gemini).

Reasoning

The eval framework's `ThreadPool` dispatches each eval sample as an independent unit of work. Each sample typically involves one or more API calls. At scale (hundreds or thousands of samples), the default 10 threads provide a 10x speedup over sequential execution, but this can overwhelm API rate limits. The official documentation explicitly warns: "Running with more threads will make the eval faster, though keep in mind the costs and your rate limits." The exponential backoff retry mechanism (`evals/utils/api_utils.py`) handles transient failures, but sustained rate limit violations require reducing thread count.

The timeout mechanism is necessary because individual API calls can hang indefinitely. The 40-second default is tuned for typical short-prompt evals. Long-context evaluations (e.g., document summarization, multi-turn conversations) may legitimately need more time per sample.

Sequential mode is mandatory for:

  • Human CLI solvers (bluff eval enforces this with a `ValueError`)
  • Google Gemini solver (known threading issue, forced in tests)
  • Debugging (deterministic execution order, clearer log output)

Code Evidence

Thread configuration from `evals/eval.py:124-146`:

threads = int(os.environ.get("EVALS_THREADS", "10"))
show_progress = bool(os.environ.get("EVALS_SHOW_EVAL_PROGRESS", show_progress))
# ...
with ThreadPool(threads) as pool:
    if os.environ.get("EVALS_SEQUENTIAL", "0") in {"1", "true", "yes"}:
        logger.info("Running in sequential mode!")
        iter = map(eval_sample, work_items)
    else:
        logger.info(f"Running in threaded mode with {threads} threads!")
        iter = pool.imap_unordered(eval_sample, work_items)

Thread timeout from `evals/utils/api_utils.py:6`:

EVALS_THREAD_TIMEOUT = float(os.environ.get("EVALS_THREAD_TIMEOUT", "40"))

Documentation excerpt from `docs/run-evals.md:31-36`:

By default we run with 10 threads, and each thread times out and restarts
after 40 seconds. You can configure this, e.g.,
EVALS_THREADS=42 EVALS_THREAD_TIMEOUT=600 oaievalset gpt-3.5-turbo test
Running with more threads will make the eval faster, though keep in mind
the costs and your rate limits.

Mandatory sequential mode for human CLI from `evals/elsuite/bluff/eval.py:182-183`:

if os.environ.get("EVALS_SEQUENTIAL") != "1":
    raise ValueError("human_cli player is available only with EVALS_SEQUENTIAL=1")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment