Implementation:Mlc ai Mlc llm Router Translate request
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Distributed_Serving |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for programmable request routing that translates incoming API requests into orchestrated multi-engine operations, provided by MLC-LLM.
Description
Router.translate_request is the central dispatch method that converts an OpenAI-compatible CompletionRequest into the appropriate sequence of microserving API calls based on the router's configured mode. It acts as the strategy selector in the routing pipeline:
- In
"disagg"mode, it delegates to_handle_completion_disagg, which executes the three-step disaggregated serving protocol (prepare-receive, remote-send, start-generate) across separate prefill and decode engines. - In
"round-robin"mode, it delegates to_handle_completion_round_robin, which forwards the entire request to the least-loaded engine endpoint.
The method is an async generator that yields CompletionResponse objects for streaming consumption. It can also yield None to signal that the request was preempted and should be retried by the caller (handle_completion).
This method serves as the primary extension point for custom routing -- subclasses can override it to implement entirely different routing strategies.
Usage
Use translate_request when you need to process a completion request through the router's dispatch logic. It is called internally by handle_completion within a retry loop that handles preemption. Override this method in a Router subclass to implement custom routing policies.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/router/router.py(Lines 133-148)
Signature
async def translate_request(
self,
request: openai_api_protocol.CompletionRequest,
request_id: str,
) -> AsyncGenerator[openai_api_protocol.CompletionResponse, Any]:
Import
from mlc_llm.router import Router
# translate_request is an instance method on Router
router = Router(model="your-model", ...)
# Called as: router.translate_request(request, request_id)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| request | openai_api_protocol.CompletionRequest |
Yes | The OpenAI-compatible completion request to route. The prompt field should already be tokenized (list of ints) by the caller (handle_completion).
|
| request_id | str |
Yes | A unique identifier for this request, used for tracking across microserving calls. In disaggregated mode, this is set as the user field of the request.
|
Outputs
| Name | Type | Description |
|---|---|---|
| (yields) | AsyncGenerator[openai_api_protocol.CompletionResponse, Any] |
An async generator that yields CompletionResponse objects. In streaming mode, multiple response chunks are yielded. In non-streaming mode, a single response is yielded. A None yield signals preemption (the request was interrupted and should be retried).
|
Usage Examples
Basic Usage
from mlc_llm.router import Router
from mlc_llm.protocol.openai_api_protocol import CompletionRequest
router = Router(
model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
hosts=["127.0.0.1", "127.0.0.1"],
ports=[8080, 8081],
num_gpus=[1, 1],
router_mode="disagg",
)
request = CompletionRequest(
model="Llama-2-7b-chat",
prompt="The capital of France is",
max_tokens=50,
stream=True,
)
request_id = "cmpl-abc123"
# Consume the async generator
async for response in router.translate_request(request, request_id):
if response is None:
# Preemption signal -- caller should retry
break
print(response.choices[0].text, end="")
Custom Router Subclass
from typing import Any, AsyncGenerator
from mlc_llm.router import Router
from mlc_llm.protocol import openai_api_protocol
class PriorityRouter(Router):
"""A custom router that routes high-priority requests to dedicated engines."""
async def translate_request(
self,
request: openai_api_protocol.CompletionRequest,
request_id: str,
) -> AsyncGenerator[openai_api_protocol.CompletionResponse, Any]:
# Custom routing logic based on request properties
if request.user and request.user.startswith("priority-"):
# Route to a dedicated high-priority endpoint
async for response in self._handle_completion_round_robin(request):
yield response
else:
# Fall back to disaggregated routing
async for response in self._handle_completion_disagg(
request, request_id
):
yield response