Implementation:Mlc ai Mlc llm MLCEngine Init
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, LLM_Inference |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for initializing a synchronous inference engine that loads compiled model artifacts and provides a blocking API for LLM inference, provided by MLC-LLM.
Description
MLCEngine.__init__ constructs a synchronous LLM inference engine. It extends MLCEngineBase (passing "sync" as the engine kind) and attaches the Chat and Completion proxy objects that expose OpenAI-compatible interfaces. Internally, the base class performs model resolution, device detection, optional JIT compilation, tokenizer loading, background thread creation, and engine configuration. Once the constructor returns, the engine is fully initialized and ready to serve blocking inference requests through engine.chat.completions.create() or engine.completions.create().
The engine supports three preset modes: "local" (low concurrency, up to 4 sequences), "interactive" (single sequence at a time), and "server" (maximized GPU memory utilization for high throughput). These presets automatically configure max_num_sequence, max_total_sequence_length, and prefill_chunk_size.
Usage
Use MLCEngine for programmatic Python inference when you need a simple blocking API. It is ideal for scripts, notebooks, batch processing, and applications where you do not need async concurrency. For async usage (e.g., in web servers), use AsyncMLCEngine instead.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/serve/engine.py(lines 1460-1478)
Signature
class MLCEngine(engine_base.MLCEngineBase):
def __init__(
self,
model: str,
device: Union[str, Device] = "auto",
*,
model_lib: Optional[str] = None,
mode: Literal["local", "interactive", "server"] = "local",
engine_config: Optional[EngineConfig] = None,
enable_tracing: bool = False,
) -> None:
Import
from mlc_llm.serve import MLCEngine
# or
from mlc_llm.serve.engine import MLCEngine
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str |
Yes | A path to mlc-chat-config.json, an MLC model directory, or a Hugging Face repository link pointing to an MLC-compiled model.
|
| device | Union[str, Device] |
No | The device used to deploy the model (e.g., "cuda", "cuda:0", "metal", "vulkan"). Defaults to "auto" which auto-detects available GPUs.
|
| model_lib | Optional[str] |
No | The full path to the compiled model library file (e.g., a .so file). If not provided, the engine searches for a matching library or triggers JIT compilation.
|
| mode | Literal["local", "interactive", "server"] |
No | The engine mode that determines automatic configuration of batch sizes and sequence lengths. Defaults to "local".
|
| engine_config | Optional[EngineConfig] |
No | Additional configurable arguments for the MLC engine (e.g., max_num_sequence, tensor_parallel_shards).
|
| enable_tracing | bool |
No | Whether to enable event logging for request tracing. Defaults to False.
|
Outputs
| Name | Type | Description |
|---|---|---|
| (instance) | MLCEngine |
A fully initialized synchronous engine instance with .chat.completions and .completions interfaces ready for use.
|
Usage Examples
Basic Usage
from mlc_llm.serve import MLCEngine
# Initialize the engine with a model path
engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")
# Use chat completions
response = engine.chat.completions.create(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"},
],
max_tokens=256,
)
print(response.choices[0].message.content)
# Clean up
engine.terminate()
Advanced Usage with Engine Config
from mlc_llm.serve import MLCEngine
from mlc_llm.serve.config import EngineConfig
# Initialize with server mode and custom config
engine = MLCEngine(
model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
device="cuda:0",
mode="server",
engine_config=EngineConfig(
max_num_sequence=16,
tensor_parallel_shards=1,
),
enable_tracing=True,
)
# Streaming chat completion
for chunk in engine.chat.completions.create(
messages=[{"role": "user", "content": "Explain quantum computing."}],
stream=True,
max_tokens=512,
):
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
engine.terminate()