Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:BerriAI Litellm Router Load Balancing

From Leeroopedia
Knowledge Sources
Domains LLM_Ops, Infrastructure, Reliability
Last Updated 2026-02-15 16:00 GMT

Overview

End-to-end process for distributing LLM API calls across multiple model deployments with intelligent routing, automatic failover, and rate limiting.

Description

This workflow covers the setup and use of LiteLLM's Router system for production-grade LLM deployment management. The Router maintains a pool of model deployments (potentially across different providers or API keys), applies routing strategies (lowest latency, least busy, lowest cost, lowest TPM/RPM, or simple shuffle), handles deployment failures with cooldown periods, and provides automatic fallback to alternative model groups. It enables high availability and cost optimization for LLM-powered applications.

Key outputs:

  • Load-balanced LLM calls across multiple deployments
  • Automatic failover when deployments fail or hit rate limits
  • Per-deployment health tracking and cooldown management
  • Budget-aware and latency-aware routing decisions

Usage

Execute this workflow when you have multiple LLM deployments (same model across different API keys, regions, or providers) and need to distribute load, maximize availability, or optimize for cost or latency. This is essential for production environments serving multiple users or high request volumes.

Execution Steps

Step 1: Deployment Definition

Define the model deployment list, where each entry specifies a logical model name and the underlying provider configuration. Multiple deployments can share the same logical name, allowing the Router to treat them as a pool. Each deployment includes the provider model string, API key, and optional parameters like rate limits (TPM/RPM).

Key considerations:

  • Each deployment has a model_name (logical) and litellm_params (physical provider config)
  • Deployments with the same model_name form a load-balancing group
  • Rate limits (tpm, rpm) can be set per deployment to prevent overload
  • Deployments can span different providers (e.g., OpenAI + Azure for the same logical model)

Step 2: Router Initialization

Create a Router instance with the model list and configure the routing strategy. Available strategies include: simple-shuffle (random weighted), least-busy (fewest in-flight requests), latency-based-routing (lowest recent latency), cost-based-routing (lowest cost per token), and usage-based-routing (lowest TPM/RPM usage).

Key considerations:

  • The routing strategy determines how deployments are selected for each call
  • allowed_fails configures how many failures before a deployment enters cooldown
  • cooldown_time sets how long a failed deployment is excluded from routing
  • Redis can be used for cross-instance state sharing in distributed setups

Step 3: Retry and Fallback Configuration

Configure retry behavior and fallback model groups. Retries control how many times a failed call is attempted on different deployments within the same model group. Fallbacks define alternative model groups to try when the primary group is exhausted (e.g., fall back from GPT-4 to GPT-3.5-turbo).

Key considerations:

  • num_retries controls retry count within the same model group
  • fallbacks defines ordered list of alternative model groups
  • retry_policy can customize retry counts per exception type (e.g., more retries for rate limits)
  • Context window exceeded errors can trigger automatic fallback to models with larger context

Step 4: Request Routing

Make completion calls through the Router using router.completion() or router.acompletion(). The Router selects a deployment based on the configured strategy, applies pre-call checks (rate limits, budget limits, cooldowns), and dispatches the call. If the selected deployment fails, the Router automatically retries on another deployment.

What happens:

  • Router filters out cooled-down and rate-limited deployments
  • Remaining deployments are ranked by the routing strategy
  • The top-ranked deployment handles the request
  • On failure, the deployment is penalized and the next deployment is tried
  • Request metadata includes which deployment was used

Step 5: Health Monitoring

The Router continuously tracks deployment health through success/failure counters, latency measurements, and rate limit consumption. Failed deployments enter a cooldown period during which they receive no traffic. After cooldown expires, they are gradually reintroduced to the pool.

Key considerations:

  • Deployment health is tracked in an in-memory cache (or Redis for distributed setups)
  • Cooldown state is shared across routing decisions
  • Prometheus metrics can be emitted for external monitoring
  • Health check endpoints can actively probe deployment availability

Step 6: Budget and Rate Limiting

The Router enforces per-deployment and per-provider budget limits and rate limits. It tracks token and request consumption against configured limits and excludes deployments that would exceed their budget or rate limit allocation.

Key considerations:

  • TPM (tokens per minute) and RPM (requests per minute) limits are enforced per deployment
  • Provider-level budget limits cap total spend across all deployments for a provider
  • Budget tracking uses the internal cost calculator for accurate spend estimation
  • Limits reset on configurable time windows

Execution Diagram

GitHub URL

Workflow Repository