Heuristic:Run llama Llama index Batch Eval Retry Strategy
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Optimization |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
Retry with exponential backoff strategy for batch evaluation workers to handle transient API failures during LLM-as-judge evaluation.
Description
The `BatchEvalRunner` wraps all evaluation and query workers with tenacity's `@retry` decorator using exponential backoff. Each worker retries up to 3 times with wait times between 4 and 10 seconds. Combined with semaphore-based concurrency limiting (default 2 workers), this creates a resilient evaluation pipeline that gracefully handles rate limits and transient API errors.
Usage
This heuristic is built into the `BatchEvalRunner` and applies automatically. Understand it when:
- Debugging evaluation failures or timeouts
- Tuning the number of concurrent evaluation workers
- Experiencing frequent API rate limit errors during batch evaluation
The Insight (Rule of Thumb)
- Action: The retry decorator is pre-configured on all three worker types: `eval_response_worker`, `eval_worker`, and `response_worker`.
- Value: 3 attempts, exponential backoff with multiplier=1, min=4s, max=10s wait.
- Workers: Default 2 concurrent workers (via semaphore), lower than the general async default of 4.
- Trade-off: Retries add latency when API errors occur but prevent complete evaluation failure. Conservative worker count (2) reduces rate limit risk but slows total evaluation time.
Reasoning
Why 3 attempts: Most transient API errors (rate limits, timeouts, 5xx responses) resolve within 1-2 retries. Three attempts provides good reliability without excessive waiting.
Why exponential backoff (4-10s): The min 4-second wait provides enough breathing room for rate limit windows to reset. The max 10-second cap prevents unbounded waiting. The multiplier of 1 with exponential growth means waits of approximately 4s, 8s, 10s (capped).
Why 2 workers for eval (not 4): Each evaluation call involves a full LLM inference request. With GPT-4 evaluators, each call can cost significant tokens and time. Two concurrent workers balance throughput with rate limit safety, especially since evaluations are often run against expensive frontier models.
Code evidence from `evaluation/batch_runner.py:11-14`:
@retry(
reraise=True,
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10),
)
Semaphore initialization from `evaluation/batch_runner.py:90-95`:
def __init__(
self,
evaluators: Dict[str, BaseEvaluator],
workers: int = 2,
show_progress: bool = False,
):
self.semaphore = asyncio.Semaphore(self.workers)