Implementation:BerriAI Litellm Cooldown Cache

Knowledge Sources	Domains	Last Updated
litellm repository	LLM Reliability, Circuit Breaker	2026-02-15

Overview

Concrete tool for monitoring deployment health through cooldown mechanisms provided by LiteLLM, implemented as the CooldownCache class and companion handler functions.

Description

LiteLLM's health monitoring is built on two components:

CooldownCache -- A wrapper around DualCache (in-memory + Redis) that manages cooldown state for deployments. It stores CooldownCacheValue entries (exception info, status code, timestamp, cooldown duration) with TTL-based automatic expiry. It provides both sync and async interfaces for adding deployments to cooldown and retrieving active cooldowns.

_set_cooldown_deployments -- A handler function that decides whether a deployment should enter cooldown based on: (1) whether cooldown logic is enabled, (2) whether the deployment has exceeded its allowed failure threshold, or (3) whether the error type warrants immediate cooldown. When cooldown is triggered, it writes to the CooldownCache and fires an async alerting callback.

_async_get_cooldown_deployments -- An async function that retrieves all currently cooled-down deployment IDs by batch-reading cooldown cache keys for all known model IDs. Used by the router to exclude unhealthy deployments before routing.

The CooldownCacheValue is a TypedDict with fields: exception_received (masked), status_code, timestamp, and cooldown_time.

Usage

The CooldownCache is instantiated by the Router during initialization and used internally. The handler functions are called from Router failure callbacks.

Code Reference

Source Locations:

litellm/router_utils/cooldown_cache.py (lines 31-192)
litellm/router_utils/cooldown_handlers.py (lines 260-395)

CooldownCache.__init__ Signature:

class CooldownCache:
    def __init__(self, cache: DualCache, default_cooldown_time: float):

CooldownCache.add_deployment_to_cooldown Signature:

def add_deployment_to_cooldown(
    self,
    model_id: str,
    original_exception: Exception,
    exception_status: int,
    cooldown_time: Optional[float],
):

CooldownCache.async_get_active_cooldowns Signature:

async def async_get_active_cooldowns(
    self, model_ids: List[str], parent_otel_span: Optional[Span]
) -> List[Tuple[str, CooldownCacheValue]]:

_set_cooldown_deployments Signature:

def _set_cooldown_deployments(
    litellm_router_instance: LitellmRouter,
    original_exception: Any,
    exception_status: Union[str, int],
    deployment: Optional[str] = None,
    time_to_cooldown: Optional[float] = None,
) -> bool:

_async_get_cooldown_deployments Signature:

async def _async_get_cooldown_deployments(
    litellm_router_instance: LitellmRouter,
    parent_otel_span: Optional[Span],
) -> List[str]:

Import:

from litellm.router_utils.cooldown_cache import CooldownCache, CooldownCacheValue
from litellm.router_utils.cooldown_handlers import (
    _set_cooldown_deployments,
    _async_get_cooldown_deployments,
)

I/O Contract

CooldownCache.init

Input Parameter	Type	Required	Description
cache	`DualCache`	Yes	Dual cache instance (in-memory + optional Redis) for storing cooldown entries
default_cooldown_time	`float`	Yes	Default cooldown duration in seconds when no per-deployment value is specified

add_deployment_to_cooldown

Input Parameter	Type	Required	Description
model_id	`str`	Yes	Unique deployment identifier
original_exception	`Exception`	Yes	The exception that triggered the cooldown
exception_status	`int`	Yes	HTTP status code of the exception
cooldown_time	`Optional[float]`	No	Override cooldown duration; uses default if `None`

async_get_active_cooldowns

Input Parameter	Type	Required	Description
model_ids	`List[str]`	Yes	List of all deployment model IDs to check
parent_otel_span	`Optional[Span]`	No	OpenTelemetry span for distributed tracing

Output	Type	Description
active_cooldowns	`List[Tuple[str, CooldownCacheValue]]`	List of (model_id, cooldown_data) tuples for deployments currently in cooldown

_set_cooldown_deployments

Output	Type	Description
result	`bool`	`True` if the deployment was placed in cooldown, `False` otherwise

_async_get_cooldown_deployments

Output	Type	Description
deployment_ids	`List[str]`	List of deployment IDs currently in cooldown

Usage Examples

Router configured with cooldown settings:

from litellm import Router
from litellm.types.router import AllowedFailsPolicy

router = Router(
    model_list=model_list,
    cooldown_time=60.0,     # 60 seconds cooldown
    allowed_fails=3,        # 3 failures before cooldown
)

Fine-grained allowed fails policy:

router = Router(
    model_list=model_list,
    cooldown_time=30.0,
    allowed_fails_policy=AllowedFailsPolicy(
        RateLimitErrorAllowedFails=5,
        AuthenticationErrorAllowedFails=0,   # immediate cooldown
        TimeoutErrorAllowedFails=3,
        InternalServerErrorAllowedFails=2,
    ),
)

Cooldown cache key format:

# Keys follow the pattern: "deployment:{model_id}:cooldown"
# Example:
from litellm.router_utils.cooldown_cache import CooldownCache

key = CooldownCache.get_cooldown_cache_key("abc-123-def")
# Returns: "deployment:abc-123-def:cooldown"

Retrieving minimum cooldown time for a model group:

# Useful for determining how long to wait before retrying a model group
min_cooldown = router.cooldown_cache.get_min_cooldown(
    model_ids=["deploy-1", "deploy-2", "deploy-3"],
    parent_otel_span=None,
)
print(f"Minimum cooldown remaining: {min_cooldown}s")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment