Implementation:Mlc ai Mlc llm Router Translate request

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Distributed_Serving
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for programmable request routing that translates incoming API requests into orchestrated multi-engine operations, provided by MLC-LLM.

Description

Router.translate_request is the central dispatch method that converts an OpenAI-compatible CompletionRequest into the appropriate sequence of microserving API calls based on the router's configured mode. It acts as the strategy selector in the routing pipeline:

In "disagg" mode, it delegates to _handle_completion_disagg, which executes the three-step disaggregated serving protocol (prepare-receive, remote-send, start-generate) across separate prefill and decode engines.
In "round-robin" mode, it delegates to _handle_completion_round_robin, which forwards the entire request to the least-loaded engine endpoint.

The method is an async generator that yields CompletionResponse objects for streaming consumption. It can also yield None to signal that the request was preempted and should be retried by the caller (handle_completion).

This method serves as the primary extension point for custom routing -- subclasses can override it to implement entirely different routing strategies.

Usage

Use translate_request when you need to process a completion request through the router's dispatch logic. It is called internally by handle_completion within a retry loop that handles preemption. Override this method in a Router subclass to implement custom routing policies.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/router/router.py (Lines 133-148)

Signature

async def translate_request(
    self,
    request: openai_api_protocol.CompletionRequest,
    request_id: str,
) -> AsyncGenerator[openai_api_protocol.CompletionResponse, Any]:

Import

from mlc_llm.router import Router

# translate_request is an instance method on Router
router = Router(model="your-model", ...)
# Called as: router.translate_request(request, request_id)

I/O Contract

Inputs

Name	Type	Required	Description
request	`openai_api_protocol.CompletionRequest`	Yes	The OpenAI-compatible completion request to route. The `prompt` field should already be tokenized (list of ints) by the caller (`handle_completion`).
request_id	`str`	Yes	A unique identifier for this request, used for tracking across microserving calls. In disaggregated mode, this is set as the `user` field of the request.

Outputs

Name	Type	Description
(yields)	`AsyncGenerator[openai_api_protocol.CompletionResponse, Any]`	An async generator that yields `CompletionResponse` objects. In streaming mode, multiple response chunks are yielded. In non-streaming mode, a single response is yielded. A `None` yield signals preemption (the request was interrupted and should be retried).

Usage Examples

Basic Usage

from mlc_llm.router import Router
from mlc_llm.protocol.openai_api_protocol import CompletionRequest

router = Router(
    model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
    hosts=["127.0.0.1", "127.0.0.1"],
    ports=[8080, 8081],
    num_gpus=[1, 1],
    router_mode="disagg",
)

request = CompletionRequest(
    model="Llama-2-7b-chat",
    prompt="The capital of France is",
    max_tokens=50,
    stream=True,
)
request_id = "cmpl-abc123"

# Consume the async generator
async for response in router.translate_request(request, request_id):
    if response is None:
        # Preemption signal -- caller should retry
        break
    print(response.choices[0].text, end="")

Custom Router Subclass

from typing import Any, AsyncGenerator
from mlc_llm.router import Router
from mlc_llm.protocol import openai_api_protocol

class PriorityRouter(Router):
    """A custom router that routes high-priority requests to dedicated engines."""

    async def translate_request(
        self,
        request: openai_api_protocol.CompletionRequest,
        request_id: str,
    ) -> AsyncGenerator[openai_api_protocol.CompletionResponse, Any]:
        # Custom routing logic based on request properties
        if request.user and request.user.startswith("priority-"):
            # Route to a dedicated high-priority endpoint
            async for response in self._handle_completion_round_robin(request):
                yield response
        else:
            # Fall back to disaggregated routing
            async for response in self._handle_completion_disagg(
                request, request_id
            ):
                yield response

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Custom_Request_Routing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment