Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Run llama Llama index Batch Eval Retry Strategy

From Leeroopedia
Knowledge Sources
Domains Evaluation, Optimization
Last Updated 2026-02-11 19:00 GMT

Overview

Retry with exponential backoff strategy for batch evaluation workers to handle transient API failures during LLM-as-judge evaluation.

Description

The `BatchEvalRunner` wraps all evaluation and query workers with tenacity's `@retry` decorator using exponential backoff. Each worker retries up to 3 times with wait times between 4 and 10 seconds. Combined with semaphore-based concurrency limiting (default 2 workers), this creates a resilient evaluation pipeline that gracefully handles rate limits and transient API errors.

Usage

This heuristic is built into the `BatchEvalRunner` and applies automatically. Understand it when:

  • Debugging evaluation failures or timeouts
  • Tuning the number of concurrent evaluation workers
  • Experiencing frequent API rate limit errors during batch evaluation

The Insight (Rule of Thumb)

  • Action: The retry decorator is pre-configured on all three worker types: `eval_response_worker`, `eval_worker`, and `response_worker`.
  • Value: 3 attempts, exponential backoff with multiplier=1, min=4s, max=10s wait.
  • Workers: Default 2 concurrent workers (via semaphore), lower than the general async default of 4.
  • Trade-off: Retries add latency when API errors occur but prevent complete evaluation failure. Conservative worker count (2) reduces rate limit risk but slows total evaluation time.

Reasoning

Why 3 attempts: Most transient API errors (rate limits, timeouts, 5xx responses) resolve within 1-2 retries. Three attempts provides good reliability without excessive waiting.

Why exponential backoff (4-10s): The min 4-second wait provides enough breathing room for rate limit windows to reset. The max 10-second cap prevents unbounded waiting. The multiplier of 1 with exponential growth means waits of approximately 4s, 8s, 10s (capped).

Why 2 workers for eval (not 4): Each evaluation call involves a full LLM inference request. With GPT-4 evaluators, each call can cost significant tokens and time. Two concurrent workers balance throughput with rate limit safety, especially since evaluations are often run against expensive frontier models.

Code evidence from `evaluation/batch_runner.py:11-14`:

@retry(
    reraise=True,
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
)

Semaphore initialization from `evaluation/batch_runner.py:90-95`:

def __init__(
    self,
    evaluators: Dict[str, BaseEvaluator],
    workers: int = 2,
    show_progress: bool = False,
):
    self.semaphore = asyncio.Semaphore(self.workers)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment