Workflow:BerriAI Litellm Router Load Balancing
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Infrastructure, Reliability |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
End-to-end process for distributing LLM API calls across multiple model deployments with intelligent routing, automatic failover, and rate limiting.
Description
This workflow covers the setup and use of LiteLLM's Router system for production-grade LLM deployment management. The Router maintains a pool of model deployments (potentially across different providers or API keys), applies routing strategies (lowest latency, least busy, lowest cost, lowest TPM/RPM, or simple shuffle), handles deployment failures with cooldown periods, and provides automatic fallback to alternative model groups. It enables high availability and cost optimization for LLM-powered applications.
Key outputs:
- Load-balanced LLM calls across multiple deployments
- Automatic failover when deployments fail or hit rate limits
- Per-deployment health tracking and cooldown management
- Budget-aware and latency-aware routing decisions
Usage
Execute this workflow when you have multiple LLM deployments (same model across different API keys, regions, or providers) and need to distribute load, maximize availability, or optimize for cost or latency. This is essential for production environments serving multiple users or high request volumes.
Execution Steps
Step 1: Deployment Definition
Define the model deployment list, where each entry specifies a logical model name and the underlying provider configuration. Multiple deployments can share the same logical name, allowing the Router to treat them as a pool. Each deployment includes the provider model string, API key, and optional parameters like rate limits (TPM/RPM).
Key considerations:
- Each deployment has a
model_name(logical) andlitellm_params(physical provider config) - Deployments with the same
model_nameform a load-balancing group - Rate limits (
tpm,rpm) can be set per deployment to prevent overload - Deployments can span different providers (e.g., OpenAI + Azure for the same logical model)
Step 2: Router Initialization
Create a Router instance with the model list and configure the routing strategy. Available strategies include: simple-shuffle (random weighted), least-busy (fewest in-flight requests), latency-based-routing (lowest recent latency), cost-based-routing (lowest cost per token), and usage-based-routing (lowest TPM/RPM usage).
Key considerations:
- The routing strategy determines how deployments are selected for each call
allowed_failsconfigures how many failures before a deployment enters cooldowncooldown_timesets how long a failed deployment is excluded from routing- Redis can be used for cross-instance state sharing in distributed setups
Step 3: Retry and Fallback Configuration
Configure retry behavior and fallback model groups. Retries control how many times a failed call is attempted on different deployments within the same model group. Fallbacks define alternative model groups to try when the primary group is exhausted (e.g., fall back from GPT-4 to GPT-3.5-turbo).
Key considerations:
num_retriescontrols retry count within the same model groupfallbacksdefines ordered list of alternative model groupsretry_policycan customize retry counts per exception type (e.g., more retries for rate limits)- Context window exceeded errors can trigger automatic fallback to models with larger context
Step 4: Request Routing
Make completion calls through the Router using router.completion() or router.acompletion(). The Router selects a deployment based on the configured strategy, applies pre-call checks (rate limits, budget limits, cooldowns), and dispatches the call. If the selected deployment fails, the Router automatically retries on another deployment.
What happens:
- Router filters out cooled-down and rate-limited deployments
- Remaining deployments are ranked by the routing strategy
- The top-ranked deployment handles the request
- On failure, the deployment is penalized and the next deployment is tried
- Request metadata includes which deployment was used
Step 5: Health Monitoring
The Router continuously tracks deployment health through success/failure counters, latency measurements, and rate limit consumption. Failed deployments enter a cooldown period during which they receive no traffic. After cooldown expires, they are gradually reintroduced to the pool.
Key considerations:
- Deployment health is tracked in an in-memory cache (or Redis for distributed setups)
- Cooldown state is shared across routing decisions
- Prometheus metrics can be emitted for external monitoring
- Health check endpoints can actively probe deployment availability
Step 6: Budget and Rate Limiting
The Router enforces per-deployment and per-provider budget limits and rate limits. It tracks token and request consumption against configured limits and excludes deployments that would exceed their budget or rate limit allocation.
Key considerations:
- TPM (tokens per minute) and RPM (requests per minute) limits are enforced per deployment
- Provider-level budget limits cap total spend across all deployments for a provider
- Budget tracking uses the internal cost calculator for accurate spend estimation
- Limits reset on configurable time windows