Workflow:Mlc ai Mlc llm REST API Serving

Knowledge Sources	MLC-LLM MLC-LLM Docs REST API Guide
Domains	LLMs, Model_Serving, REST_API, OpenAI_Compatibility
Last Updated	2026-02-09 20:00 GMT

Overview

End-to-end process for deploying a compiled MLC-LLM model as an OpenAI-compatible REST API server with streaming support, continuous batching, and optional speculative decoding.

Description

This workflow covers launching an MLC-LLM model as a production-capable REST server. The server exposes OpenAI-compatible endpoints for chat completions and text completions, supports streaming responses via Server-Sent Events, handles concurrent requests through continuous batching, and provides monitoring via metrics endpoints. Advanced features include speculative decoding for faster generation, prefix caching for repeated prompts, tensor parallelism for multi-GPU inference, and CORS configuration for web frontends.

Key outputs:

Running HTTP server with OpenAI-compatible API endpoints
/v1/chat/completions and /v1/completions endpoints
/v1/models listing endpoint
/metrics Prometheus-compatible monitoring endpoint

Usage

Execute this workflow when you need to serve a compiled model over HTTP for integration with applications, web services, or any OpenAI-compatible client. This is the standard deployment path for server-side LLM inference, suitable for both development (local mode) and production (server mode) scenarios. Clients can use the official OpenAI SDK, raw HTTP requests, or frameworks like LangChain to interact with the server.

Execution Steps

Step 1: Prepare model artifacts

Ensure the model is available for serving, either as a pre-quantized HuggingFace model (using HF:// protocol for automatic download) or as locally compiled artifacts from the Model Compilation workflow. The model must include quantized weights, an mlc-chat-config.json, and optionally a pre-compiled model library. If no pre-compiled library is provided, MLC-LLM will perform just-in-time (JIT) compilation on first load.

Key considerations:

Pre-quantized models from mlc-ai HuggingFace organization work out of the box
JIT compilation adds startup latency but eliminates the need for manual compilation
For production, pre-compiling the model library is recommended for faster startup

Step 2: Configure server mode

Select the appropriate server execution mode based on deployment scenario. The mode controls concurrency limits and memory allocation strategy. local mode uses conservative settings suitable for development, interactive mode limits to a single concurrent request, and server mode maximizes GPU memory utilization and request concurrency for production workloads.

Key considerations:

server mode aggressively allocates KV cache memory for maximum throughput
local mode is the default and appropriate for single-user development
interactive mode provides the lowest latency for single-request scenarios

Step 3: Launch the server

Start the FastAPI-based REST server with the configured model, device, and mode settings. The server binds to the specified host and port and begins accepting HTTP requests. Additional configuration includes CORS settings for browser access, debug endpoints for tracing, and engine overrides for fine-tuning batch sizes, KV cache allocation, and GPU memory utilization.

Key considerations:

Default binding is 127.0.0.1:8000 (localhost only)
Set host to 0.0.0.0 for network-accessible deployment
Enable CORS if serving web browser clients
For multi-GPU, specify tensor_parallel_shards in overrides

Step 4: Configure advanced features

Optionally enable advanced serving features. Speculative decoding (small draft model, Eagle, or Medusa) accelerates generation by predicting multiple tokens per step. Prefix caching (radix mode) reuses KV cache across requests sharing common prefixes such as system prompts. Event tracing captures per-request timing data in Chrome Trace format for performance analysis.

Key considerations:

Speculative decoding requires a compatible draft model or Medusa/Eagle heads
Prefix caching is most effective with repeated system prompts across requests
Tracing adds overhead and should be disabled in production

Step 5: Integrate client applications

Connect client applications to the running server using the OpenAI-compatible API. Clients can send chat completion requests with message histories, configure generation parameters (temperature, top_p, max_tokens, stop sequences), and receive responses in either streaming or non-streaming mode. The server also supports function calling (tool use) for structured outputs.

Key considerations:

The official OpenAI Python/Node.js SDKs work by pointing base_url to the MLC server
Streaming responses use Server-Sent Events (SSE) protocol
Multiple models can be served simultaneously using the additional-models flag
LangChain, LlamaIndex, and other frameworks integrate via the OpenAI-compatible interface

Execution Diagram

GitHub URL

Workflow Repository