Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Mlc llm Router Serve

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Distributed_Serving
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for launching a coordinated deployment of multiple engine endpoints behind a single router entry point, provided by MLC-LLM.

Description

The serve() function is the top-level entry point for starting a complete disaggregated or round-robin serving deployment. It performs three major steps:

  1. Instantiates the router: Creates a Router (or custom subclass) with the specified model, engine hosts/ports/GPUs, caching, and routing mode. This triggers engine server startup and NVSHMEM initialization.
  2. Registers the API endpoint: Creates a FastAPI application with a POST /v1/completions endpoint that handles both streaming and non-streaming OpenAI-compatible completion requests. The endpoint generates unique request IDs, delegates to router.handle_completion(), and formats responses as server-sent events (streaming) or aggregated JSON (non-streaming).
  3. Starts the HTTP server: Launches uvicorn to serve the FastAPI application on the specified router host and port, making the system ready to accept requests.

The function also configures CORS middleware and error handling for BadRequestError exceptions.

Usage

Use serve() when you want to deploy a complete disaggregated serving system with a single function call. This is typically invoked from the MLC-LLM CLI or from application code that needs to start a production-ready inference server.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/interface/router.py (Lines 17-129)

Signature

def serve(
    model: str,
    model_lib: Optional[str],
    router_host: str,
    router_port: int,
    endpoint_hosts: List[str],
    endpoint_ports: List[int],
    endpoint_num_gpus: List[int],
    enable_prefix_cache: bool,
    router_mode: Literal["disagg", "round-robin"] = "round-robin",
    pd_balance_factor: float = 0.0,
    router_type: Type[Router] = Router,
) -> None:

Import

from mlc_llm.interface.router import serve

I/O Contract

Inputs

Name Type Required Description
model str Yes Path or identifier of the model to serve. Passed through to the Router constructor and each engine endpoint.
model_lib Optional[str] Yes Path to the compiled model library. Can be None for auto-detection.
router_host str Yes Hostname or IP address for the router's HTTP server (e.g., "0.0.0.0").
router_port int Yes Port number for the router's HTTP server (e.g., 8000).
endpoint_hosts List[str] Yes List of hostnames/IPs for each backend engine endpoint.
endpoint_ports List[int] Yes List of port numbers for each backend engine endpoint. Must be the same length as endpoint_hosts.
endpoint_num_gpus List[int] Yes Number of GPUs for each backend engine endpoint. Must be the same length as endpoint_hosts.
enable_prefix_cache bool Yes Whether to enable radix-tree prefix caching on each engine.
router_mode Literal["disagg", "round-robin"] No Routing strategy. Defaults to "round-robin".
pd_balance_factor float No Controls prefill/decode work split in disaggregated mode. Defaults to 0.0.
router_type Type[Router] No The router class to instantiate. Defaults to Router. Pass a custom subclass to inject custom routing logic.

Outputs

Name Type Description
(none) None This function does not return. It blocks indefinitely running the uvicorn HTTP server. The server is stopped by terminating the process (e.g., Ctrl+C or SIGTERM).

Usage Examples

Basic Usage

from mlc_llm.interface.router import serve

# Launch a disaggregated serving deployment with 2 engine endpoints:
# - Engine 0 (prefill): GPU 0, port 8080
# - Engine 1 (decode): GPU 1, port 8081
# - Router gateway: port 8000
serve(
    model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
    model_lib=None,
    router_host="0.0.0.0",
    router_port=8000,
    endpoint_hosts=["127.0.0.1", "127.0.0.1"],
    endpoint_ports=[8080, 8081],
    endpoint_num_gpus=[1, 1],
    enable_prefix_cache=False,
    router_mode="disagg",
)
# This call blocks forever -- the server is now running.
# Send requests to http://0.0.0.0:8000/v1/completions

Custom Router Deployment

from mlc_llm.interface.router import serve
from mlc_llm.router import Router

class MyCustomRouter(Router):
    """Custom router with application-specific routing logic."""

    async def translate_request(self, request, request_id):
        # Custom dispatch logic here
        async for response in self._handle_completion_disagg(
            request, request_id
        ):
            yield response

# Deploy with the custom router
serve(
    model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
    model_lib=None,
    router_host="0.0.0.0",
    router_port=8000,
    endpoint_hosts=["127.0.0.1", "127.0.0.1", "127.0.0.1"],
    endpoint_ports=[8080, 8081, 8082],
    endpoint_num_gpus=[2, 2, 2],
    enable_prefix_cache=True,
    router_mode="disagg",
    pd_balance_factor=0.0,
    router_type=MyCustomRouter,
)

Related Pages

Implements Principle

Environment Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment