Implementation:Mlc ai Mlc llm Serve

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Model_Serving, Web_Services
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for launching an HTTP server that wraps an inference engine to accept REST API requests provided by MLC-LLM.

Description

The serve function is the main entry point for launching the MLC-LLM REST API server. It performs the following steps in order:

Engine Creation: Instantiates an AsyncMLCEngine with the specified model, device, mode, and engine configuration. The engine config is constructed from the individual parameters passed to serve() and includes memory utilization, batching limits, speculative decoding settings, and prefix cache configuration.
Server Context Setup: Creates a ServerContext context manager and registers the model/engine pair, enabling endpoint handlers to look up the correct engine for incoming requests.
FastAPI Application Assembly: Creates a FastAPI application and configures it with:
- CORS middleware using the provided allow_credentials, allow_origins, allow_methods, and allow_headers parameters.
- OpenAI-compatible API router (/v1/chat/completions, /v1/completions, /v1/models).
- Metrics router for monitoring endpoints.
- Microserving router for disaggregated serving.
- Debug router (conditionally, only when enable_debug=True).
- A global exception handler for BadRequestError.
Server Start: Launches Uvicorn to serve the FastAPI application on the specified host and port.

Usage

Use this function to start the MLC-LLM REST API server from Python code or via the CLI (mlc_llm serve). It is the primary deployment mechanism for exposing an MLC-LLM model as an OpenAI-compatible HTTP API.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/interface/serve.py (Lines 23-110)

Signature

def serve(
    model: str,
    device: str,
    model_lib: Optional[str],
    mode: Literal["local", "interactive", "server"],
    enable_debug: bool,
    additional_models: List[Union[str, Tuple[str, str]]],
    tensor_parallel_shards: Optional[int],
    pipeline_parallel_stages: Optional[int],
    opt: Optional[str],
    max_num_sequence: Optional[int],
    max_total_sequence_length: Optional[int],
    max_single_sequence_length: Optional[int],
    prefill_chunk_size: Optional[int],
    sliding_window_size: Optional[int],
    attention_sink_size: Optional[int],
    max_history_size: Optional[int],
    gpu_memory_utilization: Optional[float],
    speculative_mode: Literal["disable", "small_draft", "eagle", "medusa"],
    spec_draft_length: Optional[int],
    spec_tree_width: Optional[int],
    prefix_cache_mode: Literal["disable", "radix"],
    prefix_cache_max_num_recycling_seqs: Optional[int],
    prefill_mode: Literal["hybrid", "chunked"],
    enable_tracing: bool,
    host: str,
    port: int,
    allow_credentials: bool,
    allow_origins: Any,
    allow_methods: Any,
    allow_headers: Any,
) -> None:
    """Serve the model with the specified configuration."""

Import

from mlc_llm.interface.serve import serve

I/O Contract

Inputs

Name	Type	Required	Description
model	`str`	Yes	Path to the model directory or HuggingFace model identifier.
device	`str`	Yes	Target device string, e.g., `"cuda"`, `"cuda:0"`, `"metal"`, `"auto"`.
model_lib	`Optional[str]`	No	Path to a pre-compiled model library. If `None`, JIT compilation is used.
mode	`Literal["local", "interactive", "server"]`	Yes	Engine mode preset controlling default batch size and sequence length limits.
enable_debug	`bool`	Yes	Whether to enable debug endpoints and allow `debug_config` in requests.
additional_models	`List[Union[str, Tuple[str, str]]]`	Yes	List of additional models to serve (for multi-model or speculative decoding setups).
tensor_parallel_shards	`Optional[int]`	No	Number of tensor parallelism shards.
pipeline_parallel_stages	`Optional[int]`	No	Number of pipeline parallelism stages.
opt	`Optional[str]`	No	Optimization flags for JIT compilation.
max_num_sequence	`Optional[int]`	No	Maximum concurrent sequence count (batch size).
max_total_sequence_length	`Optional[int]`	No	Maximum total tokens in KV cache.
max_single_sequence_length	`Optional[int]`	No	Maximum length of a single sequence.
prefill_chunk_size	`Optional[int]`	No	Maximum tokens in a single prefill step.
sliding_window_size	`Optional[int]`	No	Sliding window attention size.
attention_sink_size	`Optional[int]`	No	Number of attention sink tokens.
max_history_size	`Optional[int]`	No	Maximum RNN state history for rollback.
gpu_memory_utilization	`Optional[float]`	No	Fraction of GPU memory to use (0 to 1).
speculative_mode	`Literal["disable", "small_draft", "eagle", "medusa"]`	Yes	Speculative decoding mode.
spec_draft_length	`Optional[int]`	No	Speculative draft length.
spec_tree_width	`Optional[int]`	No	Width of the speculative decoding tree.
prefix_cache_mode	`Literal["disable", "radix"]`	Yes	Prefix caching strategy.
prefix_cache_max_num_recycling_seqs	`Optional[int]`	No	Maximum recycling sequences in prefix cache.
prefill_mode	`Literal["hybrid", "chunked"]`	Yes	Prefill strategy.
enable_tracing	`bool`	Yes	Whether to enable event tracing for requests.
host	`str`	Yes	Host address to bind the server (e.g., `"0.0.0.0"` or `"127.0.0.1"`).
port	`int`	Yes	Port number to bind the server.
allow_credentials	`bool`	Yes	Whether to allow credentials in CORS requests.
allow_origins	`Any`	Yes	Allowed origins for CORS (e.g., `["*"]` or a list of specific origins).
allow_methods	`Any`	Yes	Allowed HTTP methods for CORS (e.g., `["*"]`).
allow_headers	`Any`	Yes	Allowed headers for CORS (e.g., `["*"]`).

Outputs

Name	Type	Description
(none)	`None`	The function blocks indefinitely, running the HTTP server until the process is terminated. It does not return a value.

Usage Examples

Basic Usage

from mlc_llm.interface.serve import serve

# Start a local server on port 8000
serve(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    device="cuda",
    model_lib=None,
    mode="local",
    enable_debug=False,
    additional_models=[],
    tensor_parallel_shards=None,
    pipeline_parallel_stages=None,
    opt=None,
    max_num_sequence=None,
    max_total_sequence_length=None,
    max_single_sequence_length=None,
    prefill_chunk_size=None,
    sliding_window_size=None,
    attention_sink_size=None,
    max_history_size=None,
    gpu_memory_utilization=None,
    speculative_mode="disable",
    spec_draft_length=None,
    spec_tree_width=None,
    prefix_cache_mode="radix",
    prefix_cache_max_num_recycling_seqs=None,
    prefill_mode="hybrid",
    enable_tracing=False,
    host="127.0.0.1",
    port=8000,
    allow_credentials=False,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

CLI Equivalent

# Launch the REST server via the MLC-LLM CLI
mlc_llm serve dist/models/Llama-2-7b-chat-hf-q4f16_1 \
    --device cuda \
    --mode server \
    --host 0.0.0.0 \
    --port 8080

Testing the Running Server

# Query the models endpoint
curl http://127.0.0.1:8000/v1/models

# Send a chat completion request
curl http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "dist/models/Llama-2-7b-chat-hf-q4f16_1",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_REST_Server_Launch

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment