Implementation:Mlc ai Mlc llm Router Serve

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Distributed_Serving
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for launching a coordinated deployment of multiple engine endpoints behind a single router entry point, provided by MLC-LLM.

Description

The serve() function is the top-level entry point for starting a complete disaggregated or round-robin serving deployment. It performs three major steps:

Instantiates the router: Creates a Router (or custom subclass) with the specified model, engine hosts/ports/GPUs, caching, and routing mode. This triggers engine server startup and NVSHMEM initialization.
Registers the API endpoint: Creates a FastAPI application with a POST /v1/completions endpoint that handles both streaming and non-streaming OpenAI-compatible completion requests. The endpoint generates unique request IDs, delegates to router.handle_completion(), and formats responses as server-sent events (streaming) or aggregated JSON (non-streaming).
Starts the HTTP server: Launches uvicorn to serve the FastAPI application on the specified router host and port, making the system ready to accept requests.

The function also configures CORS middleware and error handling for BadRequestError exceptions.

Usage

Use serve() when you want to deploy a complete disaggregated serving system with a single function call. This is typically invoked from the MLC-LLM CLI or from application code that needs to start a production-ready inference server.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/interface/router.py (Lines 17-129)

Signature

def serve(
    model: str,
    model_lib: Optional[str],
    router_host: str,
    router_port: int,
    endpoint_hosts: List[str],
    endpoint_ports: List[int],
    endpoint_num_gpus: List[int],
    enable_prefix_cache: bool,
    router_mode: Literal["disagg", "round-robin"] = "round-robin",
    pd_balance_factor: float = 0.0,
    router_type: Type[Router] = Router,
) -> None:

Import

from mlc_llm.interface.router import serve

I/O Contract

Inputs

Name	Type	Required	Description
model	`str`	Yes	Path or identifier of the model to serve. Passed through to the `Router` constructor and each engine endpoint.
model_lib	`Optional[str]`	Yes	Path to the compiled model library. Can be `None` for auto-detection.
router_host	`str`	Yes	Hostname or IP address for the router's HTTP server (e.g., `"0.0.0.0"`).
router_port	`int`	Yes	Port number for the router's HTTP server (e.g., `8000`).
endpoint_hosts	`List[str]`	Yes	List of hostnames/IPs for each backend engine endpoint.
endpoint_ports	`List[int]`	Yes	List of port numbers for each backend engine endpoint. Must be the same length as `endpoint_hosts`.
endpoint_num_gpus	`List[int]`	Yes	Number of GPUs for each backend engine endpoint. Must be the same length as `endpoint_hosts`.
enable_prefix_cache	`bool`	Yes	Whether to enable radix-tree prefix caching on each engine.
router_mode	`Literal["disagg", "round-robin"]`	No	Routing strategy. Defaults to `"round-robin"`.
pd_balance_factor	`float`	No	Controls prefill/decode work split in disaggregated mode. Defaults to `0.0`.
router_type	`Type[Router]`	No	The router class to instantiate. Defaults to `Router`. Pass a custom subclass to inject custom routing logic.

Outputs

Name	Type	Description
(none)	`None`	This function does not return. It blocks indefinitely running the uvicorn HTTP server. The server is stopped by terminating the process (e.g., Ctrl+C or SIGTERM).

Usage Examples

Basic Usage

from mlc_llm.interface.router import serve

# Launch a disaggregated serving deployment with 2 engine endpoints:
# - Engine 0 (prefill): GPU 0, port 8080
# - Engine 1 (decode): GPU 1, port 8081
# - Router gateway: port 8000
serve(
    model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
    model_lib=None,
    router_host="0.0.0.0",
    router_port=8000,
    endpoint_hosts=["127.0.0.1", "127.0.0.1"],
    endpoint_ports=[8080, 8081],
    endpoint_num_gpus=[1, 1],
    enable_prefix_cache=False,
    router_mode="disagg",
)
# This call blocks forever -- the server is now running.
# Send requests to http://0.0.0.0:8000/v1/completions

Custom Router Deployment

from mlc_llm.interface.router import serve
from mlc_llm.router import Router

class MyCustomRouter(Router):
    """Custom router with application-specific routing logic."""

    async def translate_request(self, request, request_id):
        # Custom dispatch logic here
        async for response in self._handle_completion_disagg(
            request, request_id
        ):
            yield response

# Deploy with the custom router
serve(
    model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
    model_lib=None,
    router_host="0.0.0.0",
    router_port=8000,
    endpoint_hosts=["127.0.0.1", "127.0.0.1", "127.0.0.1"],
    endpoint_ports=[8080, 8081, 8082],
    endpoint_num_gpus=[2, 2, 2],
    enable_prefix_cache=True,
    router_mode="disagg",
    pd_balance_factor=0.0,
    router_type=MyCustomRouter,
)

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Orchestrated_Deployment_Launch

Environment Links

Environment:Mlc_ai_Mlc_llm_Python_Serving_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment