Heuristic:BerriAI Litellm Retry Backoff Strategy

Knowledge Sources	BerriAI/litellm Production retry tuning
Domains	LLM_Gateway, Optimization
Last Updated	2026-02-15 16:00 GMT

Overview

Exponential backoff retry strategy with jitter (0.5s initial, 8s max, 0.75s jitter) and status-code-aware retry decisions to prevent thundering herd effects.

Description

LiteLLM implements an exponential backoff retry mechanism for transient LLM API failures. The strategy starts with a 0.5-second delay, doubles on each retry up to an 8-second cap, and adds random jitter of up to 0.75 seconds. Crucially, the retry logic is status-code-aware: it retries on 429 (Rate Limit), 408 (Timeout), and 5XX errors, but does not retry on permanent failures like 400 (Bad Request) or most 401 (Auth) errors. The system also prevents infinite retry loops by clearing the retry policy after the first retry attempt.

Usage

Apply this heuristic when configuring retry behavior for LLM API calls, either through the SDK or the Router. Understanding these defaults helps set appropriate `num_retries` and `timeout` values for your use case.

The Insight (Rule of Thumb)

Backoff Parameters:
- `INITIAL_RETRY_DELAY=0.5` seconds
- `MAX_RETRY_DELAY=8.0` seconds
- `JITTER=0.75` seconds (random addition to prevent thundering herd)
- `DEFAULT_MAX_RETRIES=2` attempts
Retry Decision by Status Code:
- 429 (Rate Limit): Retry with exponential backoff
- 5XX (Server Error): Retry
- 408 (Timeout): Retry
- 401 (Auth Error): Generally do NOT retry
- Other 4XX: Do NOT retry
Anti-Loop Protection: After first retry, the retry policy is set to None to prevent recursive retry loops.
Trade-off: More retries improve success rate for transient failures but increase end-to-end latency. The 8-second cap prevents individual requests from blocking too long.

Reasoning

Exponential backoff with jitter is the industry standard for distributed systems because:

Exponential growth gives failing services time to recover (doubling from 0.5s to 1s to 2s to 4s to 8s).
Random jitter (up to 0.75s) breaks synchronization between clients that would otherwise all retry at the same moment after a shared failure, preventing thundering herd.
The 8-second cap ensures that even with many retries, no single request waits excessively. For LLM calls that may have 10-60 second latency, an 8-second retry delay is proportional.
Status-code awareness avoids wasting time retrying permanent failures (bad request format, authentication issues).

Code Evidence

Backoff constants from `litellm/constants.py:291-294`:

INITIAL_RETRY_DELAY = 0.5  # seconds
MAX_RETRY_DELAY = 8.0  # seconds
JITTER = 0.75  # Add random jitter up to 0.75s to prevent thundering herd

Anti-loop protection from `litellm/utils.py:1690-1732`:

# prevent infinite loops
# set retries to None to prevent infinite loops

Router max fallbacks from `litellm/constants.py:13`:

ROUTER_MAX_FALLBACKS = int(os.getenv("ROUTER_MAX_FALLBACKS", 5))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment