Heuristic:Arize ai Phoenix Adaptive Rate Limiting
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs |
| Last Updated | 2026-02-14 06:00 GMT |
Overview
Adaptive token bucket rate limiter that auto-adjusts request rates based on 429 errors, eliminating the need to know exact provider rate limits.
Description
Phoenix uses an AdaptiveTokenBucket rate limiter for LLM API calls during batch evaluations and experiments. Instead of requiring users to configure exact rate limits for each provider, the system starts with a configurable initial rate and dynamically adjusts it: halving the rate on each 429 error (multiplicative decrease) and slowly increasing it over time when no errors occur (exponential recovery). This approach works across all LLM providers without per-provider configuration.
The rate limiter uses a cooldown mechanism to prevent concurrent requests from triggering multiple rate reductions for the same rate limit event. After a rate limit error, the system blocks for the cooldown duration (default: 5 seconds) before allowing new requests.
Usage
This heuristic applies whenever you run batch LLM evaluations (via `llm_classify`, `run_evals`) or experiments with LLM tasks. The rate limiter is transparently applied by the executor infrastructure. Use this knowledge when:
- Tuning `initial_per_second_request_rate` for your provider tier
- Debugging slow evaluation throughput caused by rate limit throttling
- Understanding why evaluation speed varies across runs
The Insight (Rule of Thumb)
- Action: Let the AdaptiveTokenBucket auto-discover the effective rate limit. Set `initial_per_second_request_rate` to your expected sustainable rate.
- Value: Default initial rate is 5.0 RPS for evals, 1.0 RPS for client operations. Rate reduction factor is 0.5 (halve on error). Maximum rate caps at 8x the initial rate (3 consecutive doublings).
- Trade-off: The system starts conservatively after a rate limit error, gradually recovering. This means brief throughput drops after hitting limits, but prevents cascading failures.
- Key parameters:
- `rate_reduction_factor` = 0.5 (halve rate on error)
- `rate_increase_factor` = 0.01 (slow exponential recovery)
- `cooldown_seconds` = 5 (block duration after rate limit hit)
- `enforcement_window_minutes` = 1 (rate resets to initial after 1 minute of no errors)
Reasoning
LLM API providers each have different rate limits that vary by tier, model, and time of day. Hard-coding rates would require per-provider configuration and constant maintenance. The adaptive approach:
- Discovers the effective limit by probing. Starting above the actual limit quickly converges downward.
- Handles burst capacity since the recovery mechanism gradually increases the rate when the provider allows more traffic.
- Prevents thundering herd via the cooldown mechanism: concurrent requests that hit the limit during the same window do not each trigger a rate reduction.
The maximum rate is bounded to prevent runaway scaling. From `rate_limiters.py:58`:
maximum_rate_multiple = (1 / rate_reduction_factor) ** 3
maximum_per_second_request_rate = initial_per_second_request_rate * maximum_rate_multiple
With the default `rate_reduction_factor` of 0.5, this computes to 8x the initial rate as the ceiling.