Principle:BerriAI Litellm Health Monitoring
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| litellm/router_utils/cooldown_cache.py, litellm/router_utils/cooldown_handlers.py | LLM Reliability, Circuit Breaker | 2026-02-15 |
Overview
Health monitoring is the practice of tracking deployment failure states and temporarily removing unhealthy endpoints from the routing pool through time-bounded cooldown periods.
Description
In a multi-deployment routing system, individual endpoints can experience transient failures (rate limits, timeouts) or persistent failures (authentication errors, misconfigurations). Health monitoring addresses this by implementing a cooldown mechanism:
- Failure counting -- Each deployment tracks its recent failure count. When a deployment exceeds a configurable allowed fails threshold within a time window, it is placed into cooldown.
- Immediate cooldown -- Certain error types (e.g., authentication errors returning HTTP 401) trigger immediate cooldown without waiting for the threshold, since retrying is futile.
- Cooldown state -- A cooled-down deployment is stored in a cache (in-memory + optional Redis) with a TTL equal to the cooldown duration. The cache entry records the exception received, HTTP status code, timestamp, and cooldown duration.
- Exclusion from routing -- Before selecting a deployment, the router queries the cooldown cache to get the list of currently cooled-down deployment IDs. These are excluded from the candidate pool.
- Automatic recovery -- When the TTL expires, the cache entry is evicted, and the deployment is automatically returned to the healthy pool. No explicit "recovery" action is needed.
- Allowed fails policy -- Fine-grained control over how many failures of each exception type are tolerated before cooldown. For example, you might allow 1000 authentication errors (for a shared deployment) but only 3 rate limit errors.
This pattern is analogous to a circuit breaker in microservice architecture, but adapted for the LLM gateway context where the "breaker" operates per-deployment and resets automatically via cache TTL.
Usage
Use health monitoring when:
- You have multiple deployments and want to stop sending traffic to endpoints that are currently failing.
- You need to prevent cascading failures where repeated calls to a broken endpoint consume retry budget.
- You want different cooldown behavior for different error types (e.g., longer cooldown for auth errors, shorter for rate limits).
- You are running a distributed proxy and need cross-instance cooldown state via Redis.
Theoretical Basis
The cooldown mechanism implements a TTL-based circuit breaker pattern.
Pseudocode for setting cooldown:
FUNCTION set_cooldown_deployment(router, exception, status_code, deployment_id, cooldown_time):
// Guard: skip if cooldowns are disabled or deployment is not eligible
IF NOT should_run_cooldown_logic(router, deployment_id, status_code):
RETURN False
// Determine if this deployment should enter cooldown
IF should_cooldown_deployment(router, deployment_id, status_code, exception):
// Compute cache key and cooldown data
key = "deployment:" + deployment_id + ":cooldown"
data = {
exception_received: mask_sensitive_data(str(exception)),
status_code: str(status_code),
timestamp: current_time(),
cooldown_time: cooldown_time OR default_cooldown_time,
}
// Store in dual cache with TTL = cooldown_time
cache.set(key=key, value=data, ttl=cooldown_time)
// Fire cooldown callback for alerting
ASYNC fire_cooldown_event(router, deployment_id, status_code, cooldown_time)
RETURN True
RETURN False
FUNCTION should_cooldown_deployment(router, deployment_id, status_code, exception):
// Immediate cooldown for non-retryable errors (e.g., 401 Auth)
IF status_code indicates non-retryable error:
RETURN True
// Threshold-based cooldown
fail_count = router.failed_calls_cache.get(deployment_id, 0)
allowed = get_allowed_fails(router, exception)
IF fail_count >= allowed:
RETURN True
RETURN False
Pseudocode for retrieving cooldown state:
ASYNC FUNCTION get_cooldown_deployments(router):
model_ids = router.get_all_model_ids()
keys = ["deployment:" + id + ":cooldown" FOR id IN model_ids]
// Batch read from cache (checks in-memory first, then Redis)
results = AWAIT cache.batch_get(keys)
cooled_down_ids = []
FOR EACH (model_id, result) IN zip(model_ids, results):
IF result IS NOT None:
cooled_down_ids.append(model_id)
RETURN cooled_down_ids
The cache TTL serves as the automatic recovery mechanism: once the cooldown period elapses, the cache entry is evicted and the deployment becomes eligible for routing again. This avoids the need for explicit health check probes.