Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Router Translate request

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Distributed_Serving
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for programmable request routing that translates incoming API requests into orchestrated multi-engine operations, provided by MLC-LLM.

Description

Router.translate_request is the central dispatch method that converts an OpenAI-compatible CompletionRequest into the appropriate sequence of microserving API calls based on the router's configured mode. It acts as the strategy selector in the routing pipeline:

  • In "disagg" mode, it delegates to _handle_completion_disagg, which executes the three-step disaggregated serving protocol (prepare-receive, remote-send, start-generate) across separate prefill and decode engines.
  • In "round-robin" mode, it delegates to _handle_completion_round_robin, which forwards the entire request to the least-loaded engine endpoint.

The method is an async generator that yields CompletionResponse objects for streaming consumption. It can also yield None to signal that the request was preempted and should be retried by the caller (handle_completion).

This method serves as the primary extension point for custom routing -- subclasses can override it to implement entirely different routing strategies.

Usage

Use translate_request when you need to process a completion request through the router's dispatch logic. It is called internally by handle_completion within a retry loop that handles preemption. Override this method in a Router subclass to implement custom routing policies.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/router/router.py (Lines 133-148)

Signature

async def translate_request(
    self,
    request: openai_api_protocol.CompletionRequest,
    request_id: str,
) -> AsyncGenerator[openai_api_protocol.CompletionResponse, Any]:

Import

from mlc_llm.router import Router

# translate_request is an instance method on Router
router = Router(model="your-model", ...)
# Called as: router.translate_request(request, request_id)

I/O Contract

Inputs

Name Type Required Description
request openai_api_protocol.CompletionRequest Yes The OpenAI-compatible completion request to route. The prompt field should already be tokenized (list of ints) by the caller (handle_completion).
request_id str Yes A unique identifier for this request, used for tracking across microserving calls. In disaggregated mode, this is set as the user field of the request.

Outputs

Name Type Description
(yields) AsyncGenerator[openai_api_protocol.CompletionResponse, Any] An async generator that yields CompletionResponse objects. In streaming mode, multiple response chunks are yielded. In non-streaming mode, a single response is yielded. A None yield signals preemption (the request was interrupted and should be retried).

Usage Examples

Basic Usage

from mlc_llm.router import Router
from mlc_llm.protocol.openai_api_protocol import CompletionRequest

router = Router(
    model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
    hosts=["127.0.0.1", "127.0.0.1"],
    ports=[8080, 8081],
    num_gpus=[1, 1],
    router_mode="disagg",
)

request = CompletionRequest(
    model="Llama-2-7b-chat",
    prompt="The capital of France is",
    max_tokens=50,
    stream=True,
)
request_id = "cmpl-abc123"

# Consume the async generator
async for response in router.translate_request(request, request_id):
    if response is None:
        # Preemption signal -- caller should retry
        break
    print(response.choices[0].text, end="")

Custom Router Subclass

from typing import Any, AsyncGenerator
from mlc_llm.router import Router
from mlc_llm.protocol import openai_api_protocol

class PriorityRouter(Router):
    """A custom router that routes high-priority requests to dedicated engines."""

    async def translate_request(
        self,
        request: openai_api_protocol.CompletionRequest,
        request_id: str,
    ) -> AsyncGenerator[openai_api_protocol.CompletionResponse, Any]:
        # Custom routing logic based on request properties
        if request.user and request.user.startswith("priority-"):
            # Route to a dedicated high-priority endpoint
            async for response in self._handle_completion_round_robin(request):
                yield response
        else:
            # Fall back to disaggregated routing
            async for response in self._handle_completion_disagg(
                request, request_id
            ):
                yield response

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment