Principle:BerriAI Litellm Health Monitoring

Knowledge Sources	Domains	Last Updated
litellm/router_utils/cooldown_cache.py, litellm/router_utils/cooldown_handlers.py	LLM Reliability, Circuit Breaker	2026-02-15

Overview

Health monitoring is the practice of tracking deployment failure states and temporarily removing unhealthy endpoints from the routing pool through time-bounded cooldown periods.

Description

In a multi-deployment routing system, individual endpoints can experience transient failures (rate limits, timeouts) or persistent failures (authentication errors, misconfigurations). Health monitoring addresses this by implementing a cooldown mechanism:

Failure counting -- Each deployment tracks its recent failure count. When a deployment exceeds a configurable allowed fails threshold within a time window, it is placed into cooldown.
Immediate cooldown -- Certain error types (e.g., authentication errors returning HTTP 401) trigger immediate cooldown without waiting for the threshold, since retrying is futile.
Cooldown state -- A cooled-down deployment is stored in a cache (in-memory + optional Redis) with a TTL equal to the cooldown duration. The cache entry records the exception received, HTTP status code, timestamp, and cooldown duration.
Exclusion from routing -- Before selecting a deployment, the router queries the cooldown cache to get the list of currently cooled-down deployment IDs. These are excluded from the candidate pool.
Automatic recovery -- When the TTL expires, the cache entry is evicted, and the deployment is automatically returned to the healthy pool. No explicit "recovery" action is needed.
Allowed fails policy -- Fine-grained control over how many failures of each exception type are tolerated before cooldown. For example, you might allow 1000 authentication errors (for a shared deployment) but only 3 rate limit errors.

This pattern is analogous to a circuit breaker in microservice architecture, but adapted for the LLM gateway context where the "breaker" operates per-deployment and resets automatically via cache TTL.

Usage

Use health monitoring when:

You have multiple deployments and want to stop sending traffic to endpoints that are currently failing.
You need to prevent cascading failures where repeated calls to a broken endpoint consume retry budget.
You want different cooldown behavior for different error types (e.g., longer cooldown for auth errors, shorter for rate limits).
You are running a distributed proxy and need cross-instance cooldown state via Redis.

Theoretical Basis

The cooldown mechanism implements a TTL-based circuit breaker pattern.

Pseudocode for setting cooldown:

FUNCTION set_cooldown_deployment(router, exception, status_code, deployment_id, cooldown_time):
    // Guard: skip if cooldowns are disabled or deployment is not eligible
    IF NOT should_run_cooldown_logic(router, deployment_id, status_code):
        RETURN False

    // Determine if this deployment should enter cooldown
    IF should_cooldown_deployment(router, deployment_id, status_code, exception):
        // Compute cache key and cooldown data
        key = "deployment:" + deployment_id + ":cooldown"
        data = {
            exception_received: mask_sensitive_data(str(exception)),
            status_code: str(status_code),
            timestamp: current_time(),
            cooldown_time: cooldown_time OR default_cooldown_time,
        }

        // Store in dual cache with TTL = cooldown_time
        cache.set(key=key, value=data, ttl=cooldown_time)

        // Fire cooldown callback for alerting
        ASYNC fire_cooldown_event(router, deployment_id, status_code, cooldown_time)
        RETURN True

    RETURN False

FUNCTION should_cooldown_deployment(router, deployment_id, status_code, exception):
    // Immediate cooldown for non-retryable errors (e.g., 401 Auth)
    IF status_code indicates non-retryable error:
        RETURN True

    // Threshold-based cooldown
    fail_count = router.failed_calls_cache.get(deployment_id, 0)
    allowed = get_allowed_fails(router, exception)
    IF fail_count >= allowed:
        RETURN True

    RETURN False

Pseudocode for retrieving cooldown state:

ASYNC FUNCTION get_cooldown_deployments(router):
    model_ids = router.get_all_model_ids()
    keys = ["deployment:" + id + ":cooldown" FOR id IN model_ids]

    // Batch read from cache (checks in-memory first, then Redis)
    results = AWAIT cache.batch_get(keys)

    cooled_down_ids = []
    FOR EACH (model_id, result) IN zip(model_ids, results):
        IF result IS NOT None:
            cooled_down_ids.append(model_id)

    RETURN cooled_down_ids

The cache TTL serves as the automatic recovery mechanism: once the cooldown period elapses, the cache entry is evicted and the deployment becomes eligible for routing again. This avoids the need for explicit health check probes.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment