Implementation:Mlc ai Mlc llm Router Serve
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Distributed_Serving |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for launching a coordinated deployment of multiple engine endpoints behind a single router entry point, provided by MLC-LLM.
Description
The serve() function is the top-level entry point for starting a complete disaggregated or round-robin serving deployment. It performs three major steps:
- Instantiates the router: Creates a
Router(or custom subclass) with the specified model, engine hosts/ports/GPUs, caching, and routing mode. This triggers engine server startup and NVSHMEM initialization. - Registers the API endpoint: Creates a FastAPI application with a
POST /v1/completionsendpoint that handles both streaming and non-streaming OpenAI-compatible completion requests. The endpoint generates unique request IDs, delegates torouter.handle_completion(), and formats responses as server-sent events (streaming) or aggregated JSON (non-streaming). - Starts the HTTP server: Launches uvicorn to serve the FastAPI application on the specified router host and port, making the system ready to accept requests.
The function also configures CORS middleware and error handling for BadRequestError exceptions.
Usage
Use serve() when you want to deploy a complete disaggregated serving system with a single function call. This is typically invoked from the MLC-LLM CLI or from application code that needs to start a production-ready inference server.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/interface/router.py(Lines 17-129)
Signature
def serve(
model: str,
model_lib: Optional[str],
router_host: str,
router_port: int,
endpoint_hosts: List[str],
endpoint_ports: List[int],
endpoint_num_gpus: List[int],
enable_prefix_cache: bool,
router_mode: Literal["disagg", "round-robin"] = "round-robin",
pd_balance_factor: float = 0.0,
router_type: Type[Router] = Router,
) -> None:
Import
from mlc_llm.interface.router import serve
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str |
Yes | Path or identifier of the model to serve. Passed through to the Router constructor and each engine endpoint.
|
| model_lib | Optional[str] |
Yes | Path to the compiled model library. Can be None for auto-detection.
|
| router_host | str |
Yes | Hostname or IP address for the router's HTTP server (e.g., "0.0.0.0").
|
| router_port | int |
Yes | Port number for the router's HTTP server (e.g., 8000).
|
| endpoint_hosts | List[str] |
Yes | List of hostnames/IPs for each backend engine endpoint. |
| endpoint_ports | List[int] |
Yes | List of port numbers for each backend engine endpoint. Must be the same length as endpoint_hosts.
|
| endpoint_num_gpus | List[int] |
Yes | Number of GPUs for each backend engine endpoint. Must be the same length as endpoint_hosts.
|
| enable_prefix_cache | bool |
Yes | Whether to enable radix-tree prefix caching on each engine. |
| router_mode | Literal["disagg", "round-robin"] |
No | Routing strategy. Defaults to "round-robin".
|
| pd_balance_factor | float |
No | Controls prefill/decode work split in disaggregated mode. Defaults to 0.0.
|
| router_type | Type[Router] |
No | The router class to instantiate. Defaults to Router. Pass a custom subclass to inject custom routing logic.
|
Outputs
| Name | Type | Description |
|---|---|---|
| (none) | None |
This function does not return. It blocks indefinitely running the uvicorn HTTP server. The server is stopped by terminating the process (e.g., Ctrl+C or SIGTERM). |
Usage Examples
Basic Usage
from mlc_llm.interface.router import serve
# Launch a disaggregated serving deployment with 2 engine endpoints:
# - Engine 0 (prefill): GPU 0, port 8080
# - Engine 1 (decode): GPU 1, port 8081
# - Router gateway: port 8000
serve(
model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
model_lib=None,
router_host="0.0.0.0",
router_port=8000,
endpoint_hosts=["127.0.0.1", "127.0.0.1"],
endpoint_ports=[8080, 8081],
endpoint_num_gpus=[1, 1],
enable_prefix_cache=False,
router_mode="disagg",
)
# This call blocks forever -- the server is now running.
# Send requests to http://0.0.0.0:8000/v1/completions
Custom Router Deployment
from mlc_llm.interface.router import serve
from mlc_llm.router import Router
class MyCustomRouter(Router):
"""Custom router with application-specific routing logic."""
async def translate_request(self, request, request_id):
# Custom dispatch logic here
async for response in self._handle_completion_disagg(
request, request_id
):
yield response
# Deploy with the custom router
serve(
model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
model_lib=None,
router_host="0.0.0.0",
router_port=8000,
endpoint_hosts=["127.0.0.1", "127.0.0.1", "127.0.0.1"],
endpoint_ports=[8080, 8081, 8082],
endpoint_num_gpus=[2, 2, 2],
enable_prefix_cache=True,
router_mode="disagg",
pd_balance_factor=0.0,
router_type=MyCustomRouter,
)