Heuristic:Promptfoo Promptfoo Adaptive Concurrency Tuning

Knowledge Sources	Promptfoo Scheduler Architecture
Domains	Optimization, Rate_Limiting
Last Updated	2026-02-14 08:00 GMT

Overview

Adaptive concurrency management that halves concurrency on rate limits (0.5x backoff) and recovers at 1.5x after 5 consecutive successes, requiring 25 requests to fully recover from minimum to initial concurrency.

Description

Promptfoo's scheduler uses an adaptive concurrency algorithm to dynamically adjust the number of parallel API requests based on rate limit feedback from LLM providers. When a 429 (rate limit) response is received, concurrency is immediately halved. Recovery is conservative: concurrency only increases after 5 consecutive successful requests, and it increases by 50% each time (capped at the initial value). This prevents thundering herd problems where all clients simultaneously resume full speed after a rate limit clears.

The system also supports proactive reduction at 10% remaining capacity (detected via rate limit headers) to avoid hitting hard limits.

Usage

Apply this heuristic when running large-scale evaluations against rate-limited LLM APIs. The adaptive scheduler is enabled by default. Disable it with `PROMPTFOO_DISABLE_ADAPTIVE_SCHEDULER=true` if you manage rate limiting externally. Tune minimum concurrency with `PROMPTFOO_MIN_CONCURRENCY`.

The Insight (Rule of Thumb)

Action: Let the adaptive scheduler manage concurrency automatically. Do not set `maxConcurrency` higher than your API tier allows.
Value: Default initial concurrency of 4 (`DEFAULT_MAX_CONCURRENCY`). Backoff factor 0.5x, recovery factor 1.5x, recovery threshold 5.
Trade-off: Conservative recovery means slower ramp-up after rate limits, but prevents repeated 429 cascades.
Recovery Path: From min=1 back to initial=10 takes 25 successful requests: 1 -> 2 -> 3 -> 5 -> 8 -> 10.
Proactive Reduction: At 10% remaining capacity, concurrency is reduced before a hard 429 occurs.

Reasoning

LLM API rate limits vary by provider and pricing tier. A static concurrency setting either wastes quota (too low) or triggers cascading 429s (too high). The adaptive approach:

Immediate backoff (0.5x): Halving on rate limit is aggressive enough to relieve pressure quickly.
Gradual recovery (1.5x after 5 successes): Prevents oscillation between full speed and rate-limited state.
Per-evaluation tracking: Each evaluation has its own rate limit state, preventing cross-evaluation interference.
Header parsing: Supports OpenAI, Anthropic, and RFC 6585 rate limit headers for proactive reduction.

From `src/scheduler/adaptiveConcurrency.ts:17-28`:

/**
 * Recovery path with constants (initial=10, min=1):
 * 1 → ceil(1.5) = 2   (5 successes)
 * 2 → ceil(3.0) = 3   (5 successes)
 * 3 → ceil(4.5) = 5   (5 successes)
 * 5 → ceil(7.5) = 8   (5 successes)
 * 8 → ceil(12) = 10   (5 successes, capped at initial)
 *
 * Total: 25 requests to fully recover from min=1 to initial=10
 */

Rate limit header parsing from `src/scheduler/headerParser.ts:51-57`:

// Remaining counts (ordered: OpenAI, Anthropic, Standard)
result.remainingRequests = parseFirstMatch(h, [
  OPENAI_HEADERS.remainingRequests,      // x-ratelimit-remaining-requests
  ANTHROPIC_HEADERS.remainingRequests,   // anthropic-ratelimit-requests-remaining
  STANDARD_HEADERS.remainingAlt,         // x-ratelimit-remaining
  STANDARD_HEADERS.remaining,            // ratelimit-remaining
]);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment