Principle:Mlc ai Mlc llm Custom Request Routing
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Distributed_Serving |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Custom request routing is the technique of translating high-level API requests into orchestrated sequences of low-level microserving operations, enabling programmable dispatch policies that determine how each request flows through a multi-engine inference system.
Description
In disaggregated LLM serving, a single user-facing API request (e.g., an OpenAI-compatible completion request) does not map directly to a single backend call. Instead, it must be decomposed into a coordinated sequence of microserving operations that may span multiple engines. Custom request routing provides a programmable translation layer that sits between the user-facing API and the internal microserving protocol.
The key responsibilities of the routing layer are:
- Request translation: Converting a standard
CompletionRequestinto the appropriate sequence of microserving calls (prep_recv,remote_send,start_generate) based on the selected routing policy. - Policy selection: Choosing between routing strategies (e.g., disaggregated prefill-decode or round-robin) based on the router's configuration, and potentially switching strategies dynamically.
- Endpoint selection: Deciding which specific engine endpoints handle the prefill and decode phases of each request, using load-aware heuristics.
- Extensibility: Providing a clean override point (
translate_request) where custom subclasses can implement entirely new routing strategies without modifying the rest of the infrastructure.
This design follows the Strategy pattern: the routing policy is encapsulated in a method that can be swapped or overridden, while the surrounding infrastructure (request parsing, retry logic, response streaming) remains stable.
Usage
Use custom request routing when:
- You are implementing a disaggregated serving system that needs to decompose requests into multi-step microserving protocols.
- You want to support multiple routing strategies (disaggregated, round-robin, or custom) behind a single API interface.
- You need load-aware request dispatch to balance work across multiple engine endpoints.
- You are building a custom router subclass that implements domain-specific scheduling policies (e.g., priority-based routing, cost-aware routing, or latency-optimized routing).
Theoretical Basis
Request Translation as a Strategy Pattern
The routing layer implements a clean separation between the external API contract and the internal dispatch logic:
translate_request(request, request_id) -> AsyncGenerator[response]:
match router_mode:
case "disagg":
yield from disaggregated_flow(request, request_id)
case "round-robin":
yield from round_robin_flow(request)
case custom:
yield from custom_flow(request, request_id)
Each flow is an async generator that yields CompletionResponse objects (for streaming) or a single response (for non-streaming). The generator can also yield None to signal preemption, which the outer retry loop in handle_completion interprets as a request to restart the translation.
Disaggregated Flow
In disaggregated mode, the translation decomposes a single request into three sequential microserving calls:
disaggregated_flow(request, request_id):
1. prefill_server = server[0] (fixed)
2. decode_server = pick_least_loaded(server[1:])
3. metadata = prep_recv(decode_server, request)
4. if not fully_cached:
remote_send(prefill_server, request, metadata)
5. yield from start_generate(decode_server, request)
Round-Robin Flow
In round-robin mode, the translation is simpler -- it sends the full request to the least-loaded endpoint:
round_robin_flow(request):
1. endpoint = pick_least_loaded(all_servers)
2. yield from forward_request(endpoint, request)
Extensibility via Subclassing
Because translate_request is a regular instance method, custom routing policies can be implemented by subclassing Router and overriding just this method. The serve() function accepts a router_type parameter for this purpose, allowing the deployment to inject custom router implementations without modifying the framework code.