Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:BerriAI Litellm Cooldown Cache

From Leeroopedia
Knowledge Sources Domains Last Updated
litellm repository LLM Reliability, Circuit Breaker 2026-02-15

Overview

Concrete tool for monitoring deployment health through cooldown mechanisms provided by LiteLLM, implemented as the CooldownCache class and companion handler functions.

Description

LiteLLM's health monitoring is built on two components:

  • CooldownCache -- A wrapper around DualCache (in-memory + Redis) that manages cooldown state for deployments. It stores CooldownCacheValue entries (exception info, status code, timestamp, cooldown duration) with TTL-based automatic expiry. It provides both sync and async interfaces for adding deployments to cooldown and retrieving active cooldowns.
  • _set_cooldown_deployments -- A handler function that decides whether a deployment should enter cooldown based on: (1) whether cooldown logic is enabled, (2) whether the deployment has exceeded its allowed failure threshold, or (3) whether the error type warrants immediate cooldown. When cooldown is triggered, it writes to the CooldownCache and fires an async alerting callback.
  • _async_get_cooldown_deployments -- An async function that retrieves all currently cooled-down deployment IDs by batch-reading cooldown cache keys for all known model IDs. Used by the router to exclude unhealthy deployments before routing.

The CooldownCacheValue is a TypedDict with fields: exception_received (masked), status_code, timestamp, and cooldown_time.

Usage

The CooldownCache is instantiated by the Router during initialization and used internally. The handler functions are called from Router failure callbacks.

Code Reference

Source Locations:

  • litellm/router_utils/cooldown_cache.py (lines 31-192)
  • litellm/router_utils/cooldown_handlers.py (lines 260-395)

CooldownCache.__init__ Signature:

class CooldownCache:
    def __init__(self, cache: DualCache, default_cooldown_time: float):

CooldownCache.add_deployment_to_cooldown Signature:

def add_deployment_to_cooldown(
    self,
    model_id: str,
    original_exception: Exception,
    exception_status: int,
    cooldown_time: Optional[float],
):

CooldownCache.async_get_active_cooldowns Signature:

async def async_get_active_cooldowns(
    self, model_ids: List[str], parent_otel_span: Optional[Span]
) -> List[Tuple[str, CooldownCacheValue]]:

_set_cooldown_deployments Signature:

def _set_cooldown_deployments(
    litellm_router_instance: LitellmRouter,
    original_exception: Any,
    exception_status: Union[str, int],
    deployment: Optional[str] = None,
    time_to_cooldown: Optional[float] = None,
) -> bool:

_async_get_cooldown_deployments Signature:

async def _async_get_cooldown_deployments(
    litellm_router_instance: LitellmRouter,
    parent_otel_span: Optional[Span],
) -> List[str]:

Import:

from litellm.router_utils.cooldown_cache import CooldownCache, CooldownCacheValue
from litellm.router_utils.cooldown_handlers import (
    _set_cooldown_deployments,
    _async_get_cooldown_deployments,
)

I/O Contract

CooldownCache.__init__

Input Parameter Type Required Description
cache DualCache Yes Dual cache instance (in-memory + optional Redis) for storing cooldown entries
default_cooldown_time float Yes Default cooldown duration in seconds when no per-deployment value is specified

add_deployment_to_cooldown

Input Parameter Type Required Description
model_id str Yes Unique deployment identifier
original_exception Exception Yes The exception that triggered the cooldown
exception_status int Yes HTTP status code of the exception
cooldown_time Optional[float] No Override cooldown duration; uses default if None

async_get_active_cooldowns

Input Parameter Type Required Description
model_ids List[str] Yes List of all deployment model IDs to check
parent_otel_span Optional[Span] No OpenTelemetry span for distributed tracing
Output Type Description
active_cooldowns List[Tuple[str, CooldownCacheValue]] List of (model_id, cooldown_data) tuples for deployments currently in cooldown

_set_cooldown_deployments

Output Type Description
result bool True if the deployment was placed in cooldown, False otherwise

_async_get_cooldown_deployments

Output Type Description
deployment_ids List[str] List of deployment IDs currently in cooldown

Usage Examples

Router configured with cooldown settings:

from litellm import Router
from litellm.types.router import AllowedFailsPolicy

router = Router(
    model_list=model_list,
    cooldown_time=60.0,     # 60 seconds cooldown
    allowed_fails=3,        # 3 failures before cooldown
)

Fine-grained allowed fails policy:

router = Router(
    model_list=model_list,
    cooldown_time=30.0,
    allowed_fails_policy=AllowedFailsPolicy(
        RateLimitErrorAllowedFails=5,
        AuthenticationErrorAllowedFails=0,   # immediate cooldown
        TimeoutErrorAllowedFails=3,
        InternalServerErrorAllowedFails=2,
    ),
)

Cooldown cache key format:

# Keys follow the pattern: "deployment:{model_id}:cooldown"
# Example:
from litellm.router_utils.cooldown_cache import CooldownCache

key = CooldownCache.get_cooldown_cache_key("abc-123-def")
# Returns: "deployment:abc-123-def:cooldown"

Retrieving minimum cooldown time for a model group:

# Useful for determining how long to wait before retrying a model group
min_cooldown = router.cooldown_cache.get_min_cooldown(
    model_ids=["deploy-1", "deploy-2", "deploy-3"],
    parent_otel_span=None,
)
print(f"Minimum cooldown remaining: {min_cooldown}s")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment