Workflow:Mlc ai Mlc llm REST API Serving
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Serving, REST_API, OpenAI_Compatibility |
| Last Updated | 2026-02-09 20:00 GMT |
Overview
End-to-end process for deploying a compiled MLC-LLM model as an OpenAI-compatible REST API server with streaming support, continuous batching, and optional speculative decoding.
Description
This workflow covers launching an MLC-LLM model as a production-capable REST server. The server exposes OpenAI-compatible endpoints for chat completions and text completions, supports streaming responses via Server-Sent Events, handles concurrent requests through continuous batching, and provides monitoring via metrics endpoints. Advanced features include speculative decoding for faster generation, prefix caching for repeated prompts, tensor parallelism for multi-GPU inference, and CORS configuration for web frontends.
Key outputs:
- Running HTTP server with OpenAI-compatible API endpoints
- /v1/chat/completions and /v1/completions endpoints
- /v1/models listing endpoint
- /metrics Prometheus-compatible monitoring endpoint
Usage
Execute this workflow when you need to serve a compiled model over HTTP for integration with applications, web services, or any OpenAI-compatible client. This is the standard deployment path for server-side LLM inference, suitable for both development (local mode) and production (server mode) scenarios. Clients can use the official OpenAI SDK, raw HTTP requests, or frameworks like LangChain to interact with the server.
Execution Steps
Step 1: Prepare model artifacts
Ensure the model is available for serving, either as a pre-quantized HuggingFace model (using HF:// protocol for automatic download) or as locally compiled artifacts from the Model Compilation workflow. The model must include quantized weights, an mlc-chat-config.json, and optionally a pre-compiled model library. If no pre-compiled library is provided, MLC-LLM will perform just-in-time (JIT) compilation on first load.
Key considerations:
- Pre-quantized models from mlc-ai HuggingFace organization work out of the box
- JIT compilation adds startup latency but eliminates the need for manual compilation
- For production, pre-compiling the model library is recommended for faster startup
Step 2: Configure server mode
Select the appropriate server execution mode based on deployment scenario. The mode controls concurrency limits and memory allocation strategy. local mode uses conservative settings suitable for development, interactive mode limits to a single concurrent request, and server mode maximizes GPU memory utilization and request concurrency for production workloads.
Key considerations:
- server mode aggressively allocates KV cache memory for maximum throughput
- local mode is the default and appropriate for single-user development
- interactive mode provides the lowest latency for single-request scenarios
Step 3: Launch the server
Start the FastAPI-based REST server with the configured model, device, and mode settings. The server binds to the specified host and port and begins accepting HTTP requests. Additional configuration includes CORS settings for browser access, debug endpoints for tracing, and engine overrides for fine-tuning batch sizes, KV cache allocation, and GPU memory utilization.
Key considerations:
- Default binding is 127.0.0.1:8000 (localhost only)
- Set host to 0.0.0.0 for network-accessible deployment
- Enable CORS if serving web browser clients
- For multi-GPU, specify tensor_parallel_shards in overrides
Step 4: Configure advanced features
Optionally enable advanced serving features. Speculative decoding (small draft model, Eagle, or Medusa) accelerates generation by predicting multiple tokens per step. Prefix caching (radix mode) reuses KV cache across requests sharing common prefixes such as system prompts. Event tracing captures per-request timing data in Chrome Trace format for performance analysis.
Key considerations:
- Speculative decoding requires a compatible draft model or Medusa/Eagle heads
- Prefix caching is most effective with repeated system prompts across requests
- Tracing adds overhead and should be disabled in production
Step 5: Integrate client applications
Connect client applications to the running server using the OpenAI-compatible API. Clients can send chat completion requests with message histories, configure generation parameters (temperature, top_p, max_tokens, stop sequences), and receive responses in either streaming or non-streaming mode. The server also supports function calling (tool use) for structured outputs.
Key considerations:
- The official OpenAI Python/Node.js SDKs work by pointing base_url to the MLC server
- Streaming responses use Server-Sent Events (SSE) protocol
- Multiple models can be served simultaneously using the additional-models flag
- LangChain, LlamaIndex, and other frameworks integrate via the OpenAI-compatible interface