Heuristic:BerriAI Litellm Retry Backoff Strategy
| Knowledge Sources | |
|---|---|
| Domains | LLM_Gateway, Optimization |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Exponential backoff retry strategy with jitter (0.5s initial, 8s max, 0.75s jitter) and status-code-aware retry decisions to prevent thundering herd effects.
Description
LiteLLM implements an exponential backoff retry mechanism for transient LLM API failures. The strategy starts with a 0.5-second delay, doubles on each retry up to an 8-second cap, and adds random jitter of up to 0.75 seconds. Crucially, the retry logic is status-code-aware: it retries on 429 (Rate Limit), 408 (Timeout), and 5XX errors, but does not retry on permanent failures like 400 (Bad Request) or most 401 (Auth) errors. The system also prevents infinite retry loops by clearing the retry policy after the first retry attempt.
Usage
Apply this heuristic when configuring retry behavior for LLM API calls, either through the SDK or the Router. Understanding these defaults helps set appropriate `num_retries` and `timeout` values for your use case.
The Insight (Rule of Thumb)
- Backoff Parameters:
- `INITIAL_RETRY_DELAY=0.5` seconds
- `MAX_RETRY_DELAY=8.0` seconds
- `JITTER=0.75` seconds (random addition to prevent thundering herd)
- `DEFAULT_MAX_RETRIES=2` attempts
- Retry Decision by Status Code:
- 429 (Rate Limit): Retry with exponential backoff
- 5XX (Server Error): Retry
- 408 (Timeout): Retry
- 401 (Auth Error): Generally do NOT retry
- Other 4XX: Do NOT retry
- Anti-Loop Protection: After first retry, the retry policy is set to None to prevent recursive retry loops.
- Trade-off: More retries improve success rate for transient failures but increase end-to-end latency. The 8-second cap prevents individual requests from blocking too long.
Reasoning
Exponential backoff with jitter is the industry standard for distributed systems because:
- Exponential growth gives failing services time to recover (doubling from 0.5s to 1s to 2s to 4s to 8s).
- Random jitter (up to 0.75s) breaks synchronization between clients that would otherwise all retry at the same moment after a shared failure, preventing thundering herd.
- The 8-second cap ensures that even with many retries, no single request waits excessively. For LLM calls that may have 10-60 second latency, an 8-second retry delay is proportional.
- Status-code awareness avoids wasting time retrying permanent failures (bad request format, authentication issues).
Code Evidence
Backoff constants from `litellm/constants.py:291-294`:
INITIAL_RETRY_DELAY = 0.5 # seconds
MAX_RETRY_DELAY = 8.0 # seconds
JITTER = 0.75 # Add random jitter up to 0.75s to prevent thundering herd
Anti-loop protection from `litellm/utils.py:1690-1732`:
# prevent infinite loops
# set retries to None to prevent infinite loops
Router max fallbacks from `litellm/constants.py:13`:
ROUTER_MAX_FALLBACKS = int(os.getenv("ROUTER_MAX_FALLBACKS", 5))