Implementation:BerriAI Litellm Cooldown Cache
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| litellm repository | LLM Reliability, Circuit Breaker | 2026-02-15 |
Overview
Concrete tool for monitoring deployment health through cooldown mechanisms provided by LiteLLM, implemented as the CooldownCache class and companion handler functions.
Description
LiteLLM's health monitoring is built on two components:
CooldownCache-- A wrapper aroundDualCache(in-memory + Redis) that manages cooldown state for deployments. It storesCooldownCacheValueentries (exception info, status code, timestamp, cooldown duration) with TTL-based automatic expiry. It provides both sync and async interfaces for adding deployments to cooldown and retrieving active cooldowns.
_set_cooldown_deployments-- A handler function that decides whether a deployment should enter cooldown based on: (1) whether cooldown logic is enabled, (2) whether the deployment has exceeded its allowed failure threshold, or (3) whether the error type warrants immediate cooldown. When cooldown is triggered, it writes to theCooldownCacheand fires an async alerting callback.
_async_get_cooldown_deployments-- An async function that retrieves all currently cooled-down deployment IDs by batch-reading cooldown cache keys for all known model IDs. Used by the router to exclude unhealthy deployments before routing.
The CooldownCacheValue is a TypedDict with fields: exception_received (masked), status_code, timestamp, and cooldown_time.
Usage
The CooldownCache is instantiated by the Router during initialization and used internally. The handler functions are called from Router failure callbacks.
Code Reference
Source Locations:
litellm/router_utils/cooldown_cache.py(lines 31-192)litellm/router_utils/cooldown_handlers.py(lines 260-395)
CooldownCache.__init__ Signature:
class CooldownCache:
def __init__(self, cache: DualCache, default_cooldown_time: float):
CooldownCache.add_deployment_to_cooldown Signature:
def add_deployment_to_cooldown(
self,
model_id: str,
original_exception: Exception,
exception_status: int,
cooldown_time: Optional[float],
):
CooldownCache.async_get_active_cooldowns Signature:
async def async_get_active_cooldowns(
self, model_ids: List[str], parent_otel_span: Optional[Span]
) -> List[Tuple[str, CooldownCacheValue]]:
_set_cooldown_deployments Signature:
def _set_cooldown_deployments(
litellm_router_instance: LitellmRouter,
original_exception: Any,
exception_status: Union[str, int],
deployment: Optional[str] = None,
time_to_cooldown: Optional[float] = None,
) -> bool:
_async_get_cooldown_deployments Signature:
async def _async_get_cooldown_deployments(
litellm_router_instance: LitellmRouter,
parent_otel_span: Optional[Span],
) -> List[str]:
Import:
from litellm.router_utils.cooldown_cache import CooldownCache, CooldownCacheValue
from litellm.router_utils.cooldown_handlers import (
_set_cooldown_deployments,
_async_get_cooldown_deployments,
)
I/O Contract
CooldownCache.__init__
| Input Parameter | Type | Required | Description |
|---|---|---|---|
| cache | DualCache |
Yes | Dual cache instance (in-memory + optional Redis) for storing cooldown entries |
| default_cooldown_time | float |
Yes | Default cooldown duration in seconds when no per-deployment value is specified |
add_deployment_to_cooldown
| Input Parameter | Type | Required | Description |
|---|---|---|---|
| model_id | str |
Yes | Unique deployment identifier |
| original_exception | Exception |
Yes | The exception that triggered the cooldown |
| exception_status | int |
Yes | HTTP status code of the exception |
| cooldown_time | Optional[float] |
No | Override cooldown duration; uses default if None
|
async_get_active_cooldowns
| Input Parameter | Type | Required | Description |
|---|---|---|---|
| model_ids | List[str] |
Yes | List of all deployment model IDs to check |
| parent_otel_span | Optional[Span] |
No | OpenTelemetry span for distributed tracing |
| Output | Type | Description |
|---|---|---|
| active_cooldowns | List[Tuple[str, CooldownCacheValue]] |
List of (model_id, cooldown_data) tuples for deployments currently in cooldown |
_set_cooldown_deployments
| Output | Type | Description |
|---|---|---|
| result | bool |
True if the deployment was placed in cooldown, False otherwise
|
_async_get_cooldown_deployments
| Output | Type | Description |
|---|---|---|
| deployment_ids | List[str] |
List of deployment IDs currently in cooldown |
Usage Examples
Router configured with cooldown settings:
from litellm import Router
from litellm.types.router import AllowedFailsPolicy
router = Router(
model_list=model_list,
cooldown_time=60.0, # 60 seconds cooldown
allowed_fails=3, # 3 failures before cooldown
)
Fine-grained allowed fails policy:
router = Router(
model_list=model_list,
cooldown_time=30.0,
allowed_fails_policy=AllowedFailsPolicy(
RateLimitErrorAllowedFails=5,
AuthenticationErrorAllowedFails=0, # immediate cooldown
TimeoutErrorAllowedFails=3,
InternalServerErrorAllowedFails=2,
),
)
Cooldown cache key format:
# Keys follow the pattern: "deployment:{model_id}:cooldown"
# Example:
from litellm.router_utils.cooldown_cache import CooldownCache
key = CooldownCache.get_cooldown_cache_key("abc-123-def")
# Returns: "deployment:abc-123-def:cooldown"
Retrieving minimum cooldown time for a model group:
# Useful for determining how long to wait before retrying a model group
min_cooldown = router.cooldown_cache.get_min_cooldown(
model_ids=["deploy-1", "deploy-2", "deploy-3"],
parent_otel_span=None,
)
print(f"Minimum cooldown remaining: {min_cooldown}s")