Heuristic:Promptfoo Promptfoo Adaptive Concurrency Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Rate_Limiting |
| Last Updated | 2026-02-14 08:00 GMT |
Overview
Adaptive concurrency management that halves concurrency on rate limits (0.5x backoff) and recovers at 1.5x after 5 consecutive successes, requiring 25 requests to fully recover from minimum to initial concurrency.
Description
Promptfoo's scheduler uses an adaptive concurrency algorithm to dynamically adjust the number of parallel API requests based on rate limit feedback from LLM providers. When a 429 (rate limit) response is received, concurrency is immediately halved. Recovery is conservative: concurrency only increases after 5 consecutive successful requests, and it increases by 50% each time (capped at the initial value). This prevents thundering herd problems where all clients simultaneously resume full speed after a rate limit clears.
The system also supports proactive reduction at 10% remaining capacity (detected via rate limit headers) to avoid hitting hard limits.
Usage
Apply this heuristic when running large-scale evaluations against rate-limited LLM APIs. The adaptive scheduler is enabled by default. Disable it with `PROMPTFOO_DISABLE_ADAPTIVE_SCHEDULER=true` if you manage rate limiting externally. Tune minimum concurrency with `PROMPTFOO_MIN_CONCURRENCY`.
The Insight (Rule of Thumb)
- Action: Let the adaptive scheduler manage concurrency automatically. Do not set `maxConcurrency` higher than your API tier allows.
- Value: Default initial concurrency of 4 (`DEFAULT_MAX_CONCURRENCY`). Backoff factor 0.5x, recovery factor 1.5x, recovery threshold 5.
- Trade-off: Conservative recovery means slower ramp-up after rate limits, but prevents repeated 429 cascades.
- Recovery Path: From min=1 back to initial=10 takes 25 successful requests: 1 -> 2 -> 3 -> 5 -> 8 -> 10.
- Proactive Reduction: At 10% remaining capacity, concurrency is reduced before a hard 429 occurs.
Reasoning
LLM API rate limits vary by provider and pricing tier. A static concurrency setting either wastes quota (too low) or triggers cascading 429s (too high). The adaptive approach:
- Immediate backoff (0.5x): Halving on rate limit is aggressive enough to relieve pressure quickly.
- Gradual recovery (1.5x after 5 successes): Prevents oscillation between full speed and rate-limited state.
- Per-evaluation tracking: Each evaluation has its own rate limit state, preventing cross-evaluation interference.
- Header parsing: Supports OpenAI, Anthropic, and RFC 6585 rate limit headers for proactive reduction.
From `src/scheduler/adaptiveConcurrency.ts:17-28`:
/**
* Recovery path with constants (initial=10, min=1):
* 1 → ceil(1.5) = 2 (5 successes)
* 2 → ceil(3.0) = 3 (5 successes)
* 3 → ceil(4.5) = 5 (5 successes)
* 5 → ceil(7.5) = 8 (5 successes)
* 8 → ceil(12) = 10 (5 successes, capped at initial)
*
* Total: 25 requests to fully recover from min=1 to initial=10
*/
Rate limit header parsing from `src/scheduler/headerParser.ts:51-57`:
// Remaining counts (ordered: OpenAI, Anthropic, Standard)
result.remainingRequests = parseFirstMatch(h, [
OPENAI_HEADERS.remainingRequests, // x-ratelimit-remaining-requests
ANTHROPIC_HEADERS.remainingRequests, // anthropic-ratelimit-requests-remaining
STANDARD_HEADERS.remainingAlt, // x-ratelimit-remaining
STANDARD_HEADERS.remaining, // ratelimit-remaining
]);