Heuristic:BerriAI Litellm Cooldown Threshold Tuning

Knowledge Sources	BerriAI/litellm Production tuning
Domains	LLM_Gateway, Optimization
Last Updated	2026-02-15 16:00 GMT

Overview

Deployment cooldown tuning strategy using multi-tier error rate thresholds (50% failure, 5+ request minimum, 5-second cooldown) to prevent cascading failures while avoiding false positives.

Description

LiteLLM's router uses a deployment cooldown mechanism to temporarily remove failing deployments from the load balancing pool. The system uses a multi-tier approach: it tracks per-minute error rates, applies different thresholds based on whether there are single or multiple deployments in a model group, and selectively cools down based on HTTP status code categories. This prevents sending traffic to failing endpoints while avoiding overly aggressive cooldowns on transient errors.

Usage

Apply this heuristic when configuring the LiteLLM Router for production use with multiple deployments. Tune the cooldown parameters based on your deployment reliability profile and traffic volume. Critical for high-availability setups where false cooldowns could route all traffic to a single endpoint.

The Insight (Rule of Thumb)

Action: Configure cooldown thresholds via environment variables.
Values:
- `DEFAULT_FAILURE_THRESHOLD_PERCENT=0.5` (50% failure rate triggers cooldown)
- `DEFAULT_FAILURE_THRESHOLD_MINIMUM_REQUESTS=5` (need at least 5 requests before error rate applies)
- `DEFAULT_COOLDOWN_TIME_SECONDS=5` (5-second cooldown duration)
- `SINGLE_DEPLOYMENT_TRAFFIC_FAILURE_THRESHOLD=1000` (single deployment needs 1000 requests with 100% failure)
HTTP Status Code Rules:
- 429 (Rate Limit): Always cooldown
- 401 (Auth Error): Always cooldown
- 408 (Timeout): Always cooldown
- 404 (Not Found): Always cooldown
- Other 4XX (Client Error): Do NOT cooldown
- 5XX+: Always cooldown
Trade-off: Lower thresholds catch failures faster but risk false positives; higher thresholds are more tolerant but slower to react.

Reasoning

The multi-tier approach exists because:

Minimum request threshold (5 requests) prevents a single failed request from triggering cooldown, which would cause unnecessary failovers on transient errors.
Single deployment protection (1000 requests) avoids cooling down the only available deployment unless it is truly broken (100% failure rate at scale).
Status code selectivity avoids cooling down on client-side errors (400, 422) which indicate bad requests, not server issues. Rate limits (429) and auth errors (401) do trigger cooldowns because they indicate the deployment cannot serve traffic.
5-second cooldown is short enough to quickly retry recovered deployments but long enough to avoid hammering a failing endpoint.

Code Evidence

Default thresholds from `litellm/constants.py:37-74`:

DEFAULT_FAILURE_THRESHOLD_PERCENT = float(
    os.getenv("DEFAULT_FAILURE_THRESHOLD_PERCENT", 0.5)
)  # default cooldown a deployment if 50% of requests fail in a given minute

DEFAULT_COOLDOWN_TIME_SECONDS = int(os.getenv("DEFAULT_COOLDOWN_TIME_SECONDS", 5))

SINGLE_DEPLOYMENT_TRAFFIC_FAILURE_THRESHOLD = int(
    os.getenv("SINGLE_DEPLOYMENT_TRAFFIC_FAILURE_THRESHOLD", 1000)
)  # Minimum number of requests to consider "reasonable traffic".

DEFAULT_FAILURE_THRESHOLD_MINIMUM_REQUESTS = int(
    os.getenv("DEFAULT_FAILURE_THRESHOLD_MINIMUM_REQUESTS", 5)
)  # Minimum number of requests before applying error rate cooldown.

Status code decision tree from `litellm/router_utils/cooldown_handlers.py:70-91`:

if exception_status >= 400 and exception_status < 500:
    if exception_status == 429:
        return True   # Cool down 429 Rate Limit Errors
    elif exception_status == 401:
        return True   # Cool down 401 Auth Errors
    elif exception_status == 408:
        return True
    elif exception_status == 404:
        return True
    else:
        return False  # Do NOT cool down all other 4XX Errors
else:
    return True       # should cool down for all other errors

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment