Principle:BerriAI Litellm Request Routing
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| litellm/router.py | LLM Load Balancing, Request Distribution | 2026-02-15 |
Overview
Request routing is the process of directing an LLM completion request to the optimal deployment from a pool of available endpoints using a configurable selection strategy.
Description
When a caller issues a completion request with a logical model name, the routing system must:
- Resolve the model group -- Map the logical model name to the set of registered deployments that serve it.
- Filter healthy deployments -- Exclude deployments that are in cooldown (recently failed), over budget, or otherwise unavailable.
- Apply pre-call checks -- Run optional filters such as budget limiting, tag-based routing, and context window validation.
- Select a deployment -- Use the configured routing strategy to pick one deployment from the healthy candidates:
- Simple shuffle -- Random selection with optional weighting.
- Least busy -- Select the deployment with the fewest in-flight requests.
- Latency-based -- Prefer deployments with the lowest observed latency.
- Cost-based -- Prefer the cheapest deployment.
- Usage-based -- Distribute based on TPM/RPM utilization.
- Execute the call -- Forward the request to the selected deployment with the appropriate provider-specific parameters and client instance.
- Handle failures -- If the call fails, trigger retry logic within the model group, then fall back across model groups if retries are exhausted.
The routing system supports both synchronous (completion) and asynchronous (acompletion) call paths, with identical routing logic.
Usage
Use request routing when:
- You have multiple deployments behind a single model name and need to distribute load.
- You want automatic selection based on latency, cost, or utilization rather than static configuration.
- You need the routing decision to account for deployment health (cooldowns) and budget constraints.
- You require both sync and async call patterns with the same routing behavior.
Theoretical Basis
Request routing implements the Strategy Pattern for deployment selection combined with the Proxy Pattern for transparent call forwarding.
Pseudocode for the completion flow:
FUNCTION completion(model, messages, **kwargs):
// Phase 1: Prepare
kwargs["original_function"] = _completion
update_kwargs_before_fallbacks(model, kwargs)
// Phase 2: Execute with fallback wrapper
response = function_with_fallbacks(**kwargs)
RETURN response
FUNCTION _completion(model, messages, **kwargs):
// Phase 3: Select deployment
deployment = get_available_deployment(
model=model,
messages=messages,
request_kwargs=kwargs,
)
// Phase 4: Extract provider parameters
litellm_params = deployment.litellm_params
model_name = litellm_params.model
client = get_client(deployment, kwargs)
// Phase 5: Run pre-call checks (RPM validation, etc.)
IF model IS a model group (not a specific deployment ID):
routing_strategy_pre_call_checks(deployment)
// Phase 6: Execute LLM call
response = litellm.completion(
model=model_name,
messages=messages,
client=client,
**litellm_params,
**kwargs,
)
// Phase 7: Post-call validation
IF response triggers content policy violation:
RAISE ContentPolicyViolationError
RETURN response
Pseudocode for async completion:
ASYNC FUNCTION acompletion(model, messages, stream=False, **kwargs):
kwargs["original_function"] = _acompletion
update_kwargs_before_fallbacks(model, kwargs)
// Optional priority scheduling
IF request has priority:
response = AWAIT schedule_acompletion(**kwargs)
ELSE:
response = AWAIT async_function_with_fallbacks(**kwargs)
log_service_success(duration)
RETURN response
The separation between the public completion() method (which wraps the call in fallback logic) and the private _completion() method (which performs deployment selection and the actual API call) is a key architectural decision. It allows the fallback system to re-invoke _completion() with a different model without duplicating routing logic.