Implementation:Mlc ai Mlc llm Serve
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Serving, Web_Services |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for launching an HTTP server that wraps an inference engine to accept REST API requests provided by MLC-LLM.
Description
The serve function is the main entry point for launching the MLC-LLM REST API server. It performs the following steps in order:
- Engine Creation: Instantiates an
AsyncMLCEnginewith the specified model, device, mode, and engine configuration. The engine config is constructed from the individual parameters passed toserve()and includes memory utilization, batching limits, speculative decoding settings, and prefix cache configuration. - Server Context Setup: Creates a
ServerContextcontext manager and registers the model/engine pair, enabling endpoint handlers to look up the correct engine for incoming requests. - FastAPI Application Assembly: Creates a FastAPI application and configures it with:
- CORS middleware using the provided
allow_credentials,allow_origins,allow_methods, andallow_headersparameters. - OpenAI-compatible API router (
/v1/chat/completions,/v1/completions,/v1/models). - Metrics router for monitoring endpoints.
- Microserving router for disaggregated serving.
- Debug router (conditionally, only when
enable_debug=True). - A global exception handler for
BadRequestError.
- CORS middleware using the provided
- Server Start: Launches Uvicorn to serve the FastAPI application on the specified host and port.
Usage
Use this function to start the MLC-LLM REST API server from Python code or via the CLI (mlc_llm serve). It is the primary deployment mechanism for exposing an MLC-LLM model as an OpenAI-compatible HTTP API.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/interface/serve.py(Lines 23-110)
Signature
def serve(
model: str,
device: str,
model_lib: Optional[str],
mode: Literal["local", "interactive", "server"],
enable_debug: bool,
additional_models: List[Union[str, Tuple[str, str]]],
tensor_parallel_shards: Optional[int],
pipeline_parallel_stages: Optional[int],
opt: Optional[str],
max_num_sequence: Optional[int],
max_total_sequence_length: Optional[int],
max_single_sequence_length: Optional[int],
prefill_chunk_size: Optional[int],
sliding_window_size: Optional[int],
attention_sink_size: Optional[int],
max_history_size: Optional[int],
gpu_memory_utilization: Optional[float],
speculative_mode: Literal["disable", "small_draft", "eagle", "medusa"],
spec_draft_length: Optional[int],
spec_tree_width: Optional[int],
prefix_cache_mode: Literal["disable", "radix"],
prefix_cache_max_num_recycling_seqs: Optional[int],
prefill_mode: Literal["hybrid", "chunked"],
enable_tracing: bool,
host: str,
port: int,
allow_credentials: bool,
allow_origins: Any,
allow_methods: Any,
allow_headers: Any,
) -> None:
"""Serve the model with the specified configuration."""
Import
from mlc_llm.interface.serve import serve
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str |
Yes | Path to the model directory or HuggingFace model identifier. |
| device | str |
Yes | Target device string, e.g., "cuda", "cuda:0", "metal", "auto".
|
| model_lib | Optional[str] |
No | Path to a pre-compiled model library. If None, JIT compilation is used.
|
| mode | Literal["local", "interactive", "server"] |
Yes | Engine mode preset controlling default batch size and sequence length limits. |
| enable_debug | bool |
Yes | Whether to enable debug endpoints and allow debug_config in requests.
|
| additional_models | List[Union[str, Tuple[str, str]]] |
Yes | List of additional models to serve (for multi-model or speculative decoding setups). |
| tensor_parallel_shards | Optional[int] |
No | Number of tensor parallelism shards. |
| pipeline_parallel_stages | Optional[int] |
No | Number of pipeline parallelism stages. |
| opt | Optional[str] |
No | Optimization flags for JIT compilation. |
| max_num_sequence | Optional[int] |
No | Maximum concurrent sequence count (batch size). |
| max_total_sequence_length | Optional[int] |
No | Maximum total tokens in KV cache. |
| max_single_sequence_length | Optional[int] |
No | Maximum length of a single sequence. |
| prefill_chunk_size | Optional[int] |
No | Maximum tokens in a single prefill step. |
| sliding_window_size | Optional[int] |
No | Sliding window attention size. |
| attention_sink_size | Optional[int] |
No | Number of attention sink tokens. |
| max_history_size | Optional[int] |
No | Maximum RNN state history for rollback. |
| gpu_memory_utilization | Optional[float] |
No | Fraction of GPU memory to use (0 to 1). |
| speculative_mode | Literal["disable", "small_draft", "eagle", "medusa"] |
Yes | Speculative decoding mode. |
| spec_draft_length | Optional[int] |
No | Speculative draft length. |
| spec_tree_width | Optional[int] |
No | Width of the speculative decoding tree. |
| prefix_cache_mode | Literal["disable", "radix"] |
Yes | Prefix caching strategy. |
| prefix_cache_max_num_recycling_seqs | Optional[int] |
No | Maximum recycling sequences in prefix cache. |
| prefill_mode | Literal["hybrid", "chunked"] |
Yes | Prefill strategy. |
| enable_tracing | bool |
Yes | Whether to enable event tracing for requests. |
| host | str |
Yes | Host address to bind the server (e.g., "0.0.0.0" or "127.0.0.1").
|
| port | int |
Yes | Port number to bind the server. |
| allow_credentials | bool |
Yes | Whether to allow credentials in CORS requests. |
| allow_origins | Any |
Yes | Allowed origins for CORS (e.g., ["*"] or a list of specific origins).
|
| allow_methods | Any |
Yes | Allowed HTTP methods for CORS (e.g., ["*"]).
|
| allow_headers | Any |
Yes | Allowed headers for CORS (e.g., ["*"]).
|
Outputs
| Name | Type | Description |
|---|---|---|
| (none) | None |
The function blocks indefinitely, running the HTTP server until the process is terminated. It does not return a value. |
Usage Examples
Basic Usage
from mlc_llm.interface.serve import serve
# Start a local server on port 8000
serve(
model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
device="cuda",
model_lib=None,
mode="local",
enable_debug=False,
additional_models=[],
tensor_parallel_shards=None,
pipeline_parallel_stages=None,
opt=None,
max_num_sequence=None,
max_total_sequence_length=None,
max_single_sequence_length=None,
prefill_chunk_size=None,
sliding_window_size=None,
attention_sink_size=None,
max_history_size=None,
gpu_memory_utilization=None,
speculative_mode="disable",
spec_draft_length=None,
spec_tree_width=None,
prefix_cache_mode="radix",
prefix_cache_max_num_recycling_seqs=None,
prefill_mode="hybrid",
enable_tracing=False,
host="127.0.0.1",
port=8000,
allow_credentials=False,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
CLI Equivalent
# Launch the REST server via the MLC-LLM CLI
mlc_llm serve dist/models/Llama-2-7b-chat-hf-q4f16_1 \
--device cuda \
--mode server \
--host 0.0.0.0 \
--port 8080
Testing the Running Server
# Query the models endpoint
curl http://127.0.0.1:8000/v1/models
# Send a chat completion request
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "dist/models/Llama-2-7b-chat-hf-q4f16_1",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Related Pages
Implements Principle
Environment and Heuristic Links
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment