Principle:BerriAI Litellm Monitoring Operations
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| BerriAI/litellm repository | Observability, Health Checking, Metrics, Production Operations | 2026-02-15 |
Overview
Monitoring proxy health, model availability, and operational metrics for production LLM gateway operations.
Description
Monitoring and operations encompasses the practices and mechanisms for ensuring that an LLM proxy gateway is healthy, performant, and observable in production. An LLM proxy sits at the critical path of all AI-powered features in an organization, making its operational visibility essential. This domain covers three interconnected concerns:
- Health checking -- Actively probing the availability and responsiveness of configured LLM model deployments. Health checks send lightweight test requests to each model endpoint and classify deployments as healthy or unhealthy based on response success or failure within a timeout window.
- Metrics collection -- Gathering quantitative measurements about proxy behavior: request counts, latency distributions, token throughput, error rates, spend accumulation, and rate limit consumption. These metrics are typically exposed in Prometheus format for scraping by monitoring infrastructure.
- Operational alerting -- Surfacing anomalies such as budget thresholds being approached, models going unhealthy, elevated error rates, or spend exceeding soft limits.
Together, these capabilities enable operators to:
- Detect and respond to LLM provider outages before they impact end users.
- Understand cost trends and optimize model selection for budget efficiency.
- Enforce SLAs by monitoring latency percentiles and availability metrics.
- Plan capacity by observing token throughput and request volumes over time.
Usage
Use monitoring and operations when:
- Running the proxy in production where availability and performance matter.
- Managing multiple LLM provider deployments where individual provider outages must be detected automatically.
- Operating under cost constraints where spend tracking and budget alerting are required.
- Integrating the proxy into existing observability stacks (Prometheus, Grafana, Datadog, etc.).
- Implementing load balancing strategies that require real-time model health status.
- Meeting SLA requirements that demand quantified latency and availability metrics.
Theoretical Basis
Monitoring an LLM proxy combines active health probing with passive metric collection to build a comprehensive picture of system health.
Health Check Model:
FUNCTION perform_health_check(model_list, target_model, timeout):
-- Filter deployments
IF target_model IS SET THEN
model_list = FILTER(model_list, WHERE model == target_model)
-- Deduplicate by deployment ID
model_list = DEDUPLICATE_BY_ID(model_list)
healthy = []
unhealthy = []
-- Probe each deployment concurrently
tasks = []
FOR EACH deployment IN model_list:
params = deployment.litellm_params
mode = deployment.model_info.mode
task = WITH_TIMEOUT(
health_check_call(params, mode),
timeout = deployment.health_check_timeout OR DEFAULT_TIMEOUT
)
tasks.APPEND(task)
results = AWAIT_ALL(tasks)
FOR EACH (result, deployment) IN ZIP(results, model_list):
IF result IS success AND "error" NOT IN result THEN
healthy.APPEND(clean_display_data(deployment, result))
ELSE
unhealthy.APPEND(clean_display_data(deployment, result))
RETURN (healthy, unhealthy)
Metrics Collection Model (Prometheus pattern):
METRICS:
-- Request counters
proxy_total_requests: Counter (labels: model, key, team, status)
proxy_failed_requests: Counter (labels: model, key, team, error_type)
-- Latency distributions
request_total_latency: Histogram (labels: model, key, team)
llm_api_latency: Histogram (labels: model, key, team)
time_to_first_token: Histogram (labels: model, key, team)
-- Cost tracking
spend_metric: Counter (labels: model, key, team)
total_tokens: Counter (labels: model, key, team)
-- Budget gauges
remaining_budget: Gauge (labels: key, team)
remaining_rpm: Gauge (labels: key, model)
remaining_tpm: Gauge (labels: key, model)
FUNCTION on_request_complete(request, response, cost, latency):
proxy_total_requests.INCREMENT(labels_from(request))
request_total_latency.OBSERVE(latency, labels_from(request))
spend_metric.INCREMENT(cost, labels_from(request))
total_tokens.INCREMENT(response.usage.total_tokens, labels_from(request))
FUNCTION on_request_failure(request, error):
proxy_failed_requests.INCREMENT(labels_from(request, error))
Key design principles:
- Non-intrusive probing -- Health checks use minimal test prompts and short timeouts to avoid impacting production traffic or accumulating significant cost.
- Concurrent checking -- All model deployments are probed concurrently using async tasks, ensuring that slow or unresponsive endpoints do not delay the overall health check.
- Sensitive data filtering -- Health check results strip API keys, credentials, and full message content before display, exposing only safe metadata.
- Label cardinality management -- Prometheus metrics use bounded label sets to avoid cardinality explosion, with configurable label filters for enterprise deployments.
- Background vs. on-demand -- Health checks can run as periodic background tasks (configurable interval) or on-demand via the
/healthendpoint.