Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mlc ai Mlc llm Custom Request Routing

From Leeroopedia
Revision as of 17:47, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Mlc_ai_Mlc_llm_Custom_Request_Routing.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Deep_Learning, Distributed_Serving
Last Updated 2026-02-09 00:00 GMT

Overview

Custom request routing is the technique of translating high-level API requests into orchestrated sequences of low-level microserving operations, enabling programmable dispatch policies that determine how each request flows through a multi-engine inference system.

Description

In disaggregated LLM serving, a single user-facing API request (e.g., an OpenAI-compatible completion request) does not map directly to a single backend call. Instead, it must be decomposed into a coordinated sequence of microserving operations that may span multiple engines. Custom request routing provides a programmable translation layer that sits between the user-facing API and the internal microserving protocol.

The key responsibilities of the routing layer are:

  • Request translation: Converting a standard CompletionRequest into the appropriate sequence of microserving calls (prep_recv, remote_send, start_generate) based on the selected routing policy.
  • Policy selection: Choosing between routing strategies (e.g., disaggregated prefill-decode or round-robin) based on the router's configuration, and potentially switching strategies dynamically.
  • Endpoint selection: Deciding which specific engine endpoints handle the prefill and decode phases of each request, using load-aware heuristics.
  • Extensibility: Providing a clean override point (translate_request) where custom subclasses can implement entirely new routing strategies without modifying the rest of the infrastructure.

This design follows the Strategy pattern: the routing policy is encapsulated in a method that can be swapped or overridden, while the surrounding infrastructure (request parsing, retry logic, response streaming) remains stable.

Usage

Use custom request routing when:

  • You are implementing a disaggregated serving system that needs to decompose requests into multi-step microserving protocols.
  • You want to support multiple routing strategies (disaggregated, round-robin, or custom) behind a single API interface.
  • You need load-aware request dispatch to balance work across multiple engine endpoints.
  • You are building a custom router subclass that implements domain-specific scheduling policies (e.g., priority-based routing, cost-aware routing, or latency-optimized routing).

Theoretical Basis

Request Translation as a Strategy Pattern

The routing layer implements a clean separation between the external API contract and the internal dispatch logic:

translate_request(request, request_id) -> AsyncGenerator[response]:
    match router_mode:
        case "disagg":
            yield from disaggregated_flow(request, request_id)
        case "round-robin":
            yield from round_robin_flow(request)
        case custom:
            yield from custom_flow(request, request_id)

Each flow is an async generator that yields CompletionResponse objects (for streaming) or a single response (for non-streaming). The generator can also yield None to signal preemption, which the outer retry loop in handle_completion interprets as a request to restart the translation.

Disaggregated Flow

In disaggregated mode, the translation decomposes a single request into three sequential microserving calls:

disaggregated_flow(request, request_id):
    1. prefill_server = server[0]  (fixed)
    2. decode_server  = pick_least_loaded(server[1:])
    3. metadata       = prep_recv(decode_server, request)
    4. if not fully_cached:
           remote_send(prefill_server, request, metadata)
    5. yield from start_generate(decode_server, request)

Round-Robin Flow

In round-robin mode, the translation is simpler -- it sends the full request to the least-loaded endpoint:

round_robin_flow(request):
    1. endpoint = pick_least_loaded(all_servers)
    2. yield from forward_request(endpoint, request)

Extensibility via Subclassing

Because translate_request is a regular instance method, custom routing policies can be implemented by subclassing Router and overriding just this method. The serve() function accepts a router_type parameter for this purpose, allowing the deployment to inject custom router implementations without modifying the framework code.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment