Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Mlc llm Serve

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Serving, Web_Services
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for launching an HTTP server that wraps an inference engine to accept REST API requests provided by MLC-LLM.

Description

The serve function is the main entry point for launching the MLC-LLM REST API server. It performs the following steps in order:

  1. Engine Creation: Instantiates an AsyncMLCEngine with the specified model, device, mode, and engine configuration. The engine config is constructed from the individual parameters passed to serve() and includes memory utilization, batching limits, speculative decoding settings, and prefix cache configuration.
  2. Server Context Setup: Creates a ServerContext context manager and registers the model/engine pair, enabling endpoint handlers to look up the correct engine for incoming requests.
  3. FastAPI Application Assembly: Creates a FastAPI application and configures it with:
    • CORS middleware using the provided allow_credentials, allow_origins, allow_methods, and allow_headers parameters.
    • OpenAI-compatible API router (/v1/chat/completions, /v1/completions, /v1/models).
    • Metrics router for monitoring endpoints.
    • Microserving router for disaggregated serving.
    • Debug router (conditionally, only when enable_debug=True).
    • A global exception handler for BadRequestError.
  4. Server Start: Launches Uvicorn to serve the FastAPI application on the specified host and port.

Usage

Use this function to start the MLC-LLM REST API server from Python code or via the CLI (mlc_llm serve). It is the primary deployment mechanism for exposing an MLC-LLM model as an OpenAI-compatible HTTP API.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/interface/serve.py (Lines 23-110)

Signature

def serve(
    model: str,
    device: str,
    model_lib: Optional[str],
    mode: Literal["local", "interactive", "server"],
    enable_debug: bool,
    additional_models: List[Union[str, Tuple[str, str]]],
    tensor_parallel_shards: Optional[int],
    pipeline_parallel_stages: Optional[int],
    opt: Optional[str],
    max_num_sequence: Optional[int],
    max_total_sequence_length: Optional[int],
    max_single_sequence_length: Optional[int],
    prefill_chunk_size: Optional[int],
    sliding_window_size: Optional[int],
    attention_sink_size: Optional[int],
    max_history_size: Optional[int],
    gpu_memory_utilization: Optional[float],
    speculative_mode: Literal["disable", "small_draft", "eagle", "medusa"],
    spec_draft_length: Optional[int],
    spec_tree_width: Optional[int],
    prefix_cache_mode: Literal["disable", "radix"],
    prefix_cache_max_num_recycling_seqs: Optional[int],
    prefill_mode: Literal["hybrid", "chunked"],
    enable_tracing: bool,
    host: str,
    port: int,
    allow_credentials: bool,
    allow_origins: Any,
    allow_methods: Any,
    allow_headers: Any,
) -> None:
    """Serve the model with the specified configuration."""

Import

from mlc_llm.interface.serve import serve

I/O Contract

Inputs

Name Type Required Description
model str Yes Path to the model directory or HuggingFace model identifier.
device str Yes Target device string, e.g., "cuda", "cuda:0", "metal", "auto".
model_lib Optional[str] No Path to a pre-compiled model library. If None, JIT compilation is used.
mode Literal["local", "interactive", "server"] Yes Engine mode preset controlling default batch size and sequence length limits.
enable_debug bool Yes Whether to enable debug endpoints and allow debug_config in requests.
additional_models List[Union[str, Tuple[str, str]]] Yes List of additional models to serve (for multi-model or speculative decoding setups).
tensor_parallel_shards Optional[int] No Number of tensor parallelism shards.
pipeline_parallel_stages Optional[int] No Number of pipeline parallelism stages.
opt Optional[str] No Optimization flags for JIT compilation.
max_num_sequence Optional[int] No Maximum concurrent sequence count (batch size).
max_total_sequence_length Optional[int] No Maximum total tokens in KV cache.
max_single_sequence_length Optional[int] No Maximum length of a single sequence.
prefill_chunk_size Optional[int] No Maximum tokens in a single prefill step.
sliding_window_size Optional[int] No Sliding window attention size.
attention_sink_size Optional[int] No Number of attention sink tokens.
max_history_size Optional[int] No Maximum RNN state history for rollback.
gpu_memory_utilization Optional[float] No Fraction of GPU memory to use (0 to 1).
speculative_mode Literal["disable", "small_draft", "eagle", "medusa"] Yes Speculative decoding mode.
spec_draft_length Optional[int] No Speculative draft length.
spec_tree_width Optional[int] No Width of the speculative decoding tree.
prefix_cache_mode Literal["disable", "radix"] Yes Prefix caching strategy.
prefix_cache_max_num_recycling_seqs Optional[int] No Maximum recycling sequences in prefix cache.
prefill_mode Literal["hybrid", "chunked"] Yes Prefill strategy.
enable_tracing bool Yes Whether to enable event tracing for requests.
host str Yes Host address to bind the server (e.g., "0.0.0.0" or "127.0.0.1").
port int Yes Port number to bind the server.
allow_credentials bool Yes Whether to allow credentials in CORS requests.
allow_origins Any Yes Allowed origins for CORS (e.g., ["*"] or a list of specific origins).
allow_methods Any Yes Allowed HTTP methods for CORS (e.g., ["*"]).
allow_headers Any Yes Allowed headers for CORS (e.g., ["*"]).

Outputs

Name Type Description
(none) None The function blocks indefinitely, running the HTTP server until the process is terminated. It does not return a value.

Usage Examples

Basic Usage

from mlc_llm.interface.serve import serve

# Start a local server on port 8000
serve(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    device="cuda",
    model_lib=None,
    mode="local",
    enable_debug=False,
    additional_models=[],
    tensor_parallel_shards=None,
    pipeline_parallel_stages=None,
    opt=None,
    max_num_sequence=None,
    max_total_sequence_length=None,
    max_single_sequence_length=None,
    prefill_chunk_size=None,
    sliding_window_size=None,
    attention_sink_size=None,
    max_history_size=None,
    gpu_memory_utilization=None,
    speculative_mode="disable",
    spec_draft_length=None,
    spec_tree_width=None,
    prefix_cache_mode="radix",
    prefix_cache_max_num_recycling_seqs=None,
    prefill_mode="hybrid",
    enable_tracing=False,
    host="127.0.0.1",
    port=8000,
    allow_credentials=False,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

CLI Equivalent

# Launch the REST server via the MLC-LLM CLI
mlc_llm serve dist/models/Llama-2-7b-chat-hf-q4f16_1 \
    --device cuda \
    --mode server \
    --host 0.0.0.0 \
    --port 8080

Testing the Running Server

# Query the models endpoint
curl http://127.0.0.1:8000/v1/models

# Send a chat completion request
curl http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "dist/models/Llama-2-7b-chat-hf-q4f16_1",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'

Related Pages

Implements Principle

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment