Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:BerriAI Litellm Request Routing

From Leeroopedia
Knowledge Sources Domains Last Updated
litellm/router.py LLM Load Balancing, Request Distribution 2026-02-15

Overview

Request routing is the process of directing an LLM completion request to the optimal deployment from a pool of available endpoints using a configurable selection strategy.

Description

When a caller issues a completion request with a logical model name, the routing system must:

  1. Resolve the model group -- Map the logical model name to the set of registered deployments that serve it.
  2. Filter healthy deployments -- Exclude deployments that are in cooldown (recently failed), over budget, or otherwise unavailable.
  3. Apply pre-call checks -- Run optional filters such as budget limiting, tag-based routing, and context window validation.
  4. Select a deployment -- Use the configured routing strategy to pick one deployment from the healthy candidates:
    • Simple shuffle -- Random selection with optional weighting.
    • Least busy -- Select the deployment with the fewest in-flight requests.
    • Latency-based -- Prefer deployments with the lowest observed latency.
    • Cost-based -- Prefer the cheapest deployment.
    • Usage-based -- Distribute based on TPM/RPM utilization.
  5. Execute the call -- Forward the request to the selected deployment with the appropriate provider-specific parameters and client instance.
  6. Handle failures -- If the call fails, trigger retry logic within the model group, then fall back across model groups if retries are exhausted.

The routing system supports both synchronous (completion) and asynchronous (acompletion) call paths, with identical routing logic.

Usage

Use request routing when:

  • You have multiple deployments behind a single model name and need to distribute load.
  • You want automatic selection based on latency, cost, or utilization rather than static configuration.
  • You need the routing decision to account for deployment health (cooldowns) and budget constraints.
  • You require both sync and async call patterns with the same routing behavior.

Theoretical Basis

Request routing implements the Strategy Pattern for deployment selection combined with the Proxy Pattern for transparent call forwarding.

Pseudocode for the completion flow:

FUNCTION completion(model, messages, **kwargs):
    // Phase 1: Prepare
    kwargs["original_function"] = _completion
    update_kwargs_before_fallbacks(model, kwargs)

    // Phase 2: Execute with fallback wrapper
    response = function_with_fallbacks(**kwargs)
    RETURN response

FUNCTION _completion(model, messages, **kwargs):
    // Phase 3: Select deployment
    deployment = get_available_deployment(
        model=model,
        messages=messages,
        request_kwargs=kwargs,
    )

    // Phase 4: Extract provider parameters
    litellm_params = deployment.litellm_params
    model_name = litellm_params.model
    client = get_client(deployment, kwargs)

    // Phase 5: Run pre-call checks (RPM validation, etc.)
    IF model IS a model group (not a specific deployment ID):
        routing_strategy_pre_call_checks(deployment)

    // Phase 6: Execute LLM call
    response = litellm.completion(
        model=model_name,
        messages=messages,
        client=client,
        **litellm_params,
        **kwargs,
    )

    // Phase 7: Post-call validation
    IF response triggers content policy violation:
        RAISE ContentPolicyViolationError

    RETURN response

Pseudocode for async completion:

ASYNC FUNCTION acompletion(model, messages, stream=False, **kwargs):
    kwargs["original_function"] = _acompletion
    update_kwargs_before_fallbacks(model, kwargs)

    // Optional priority scheduling
    IF request has priority:
        response = AWAIT schedule_acompletion(**kwargs)
    ELSE:
        response = AWAIT async_function_with_fallbacks(**kwargs)

    log_service_success(duration)
    RETURN response

The separation between the public completion() method (which wraps the call in fallback logic) and the private _completion() method (which performs deployment selection and the actual API call) is a key architectural decision. It allows the fallback system to re-invoke _completion() with a different model without duplicating routing logic.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment