Implementation:Mlc ai Mlc llm MLCEngine Init

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, LLM_Inference
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for initializing a synchronous inference engine that loads compiled model artifacts and provides a blocking API for LLM inference, provided by MLC-LLM.

Description

MLCEngine.__init__ constructs a synchronous LLM inference engine. It extends MLCEngineBase (passing "sync" as the engine kind) and attaches the Chat and Completion proxy objects that expose OpenAI-compatible interfaces. Internally, the base class performs model resolution, device detection, optional JIT compilation, tokenizer loading, background thread creation, and engine configuration. Once the constructor returns, the engine is fully initialized and ready to serve blocking inference requests through engine.chat.completions.create() or engine.completions.create().

The engine supports three preset modes: "local" (low concurrency, up to 4 sequences), "interactive" (single sequence at a time), and "server" (maximized GPU memory utilization for high throughput). These presets automatically configure max_num_sequence, max_total_sequence_length, and prefill_chunk_size.

Usage

Use MLCEngine for programmatic Python inference when you need a simple blocking API. It is ideal for scripts, notebooks, batch processing, and applications where you do not need async concurrency. For async usage (e.g., in web servers), use AsyncMLCEngine instead.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/serve/engine.py (lines 1460-1478)

Signature

class MLCEngine(engine_base.MLCEngineBase):
    def __init__(
        self,
        model: str,
        device: Union[str, Device] = "auto",
        *,
        model_lib: Optional[str] = None,
        mode: Literal["local", "interactive", "server"] = "local",
        engine_config: Optional[EngineConfig] = None,
        enable_tracing: bool = False,
    ) -> None:

Import

from mlc_llm.serve import MLCEngine
# or
from mlc_llm.serve.engine import MLCEngine

I/O Contract

Inputs

Name	Type	Required	Description
model	`str`	Yes	A path to `mlc-chat-config.json`, an MLC model directory, or a Hugging Face repository link pointing to an MLC-compiled model.
device	`Union[str, Device]`	No	The device used to deploy the model (e.g., `"cuda"`, `"cuda:0"`, `"metal"`, `"vulkan"`). Defaults to `"auto"` which auto-detects available GPUs.
model_lib	`Optional[str]`	No	The full path to the compiled model library file (e.g., a `.so` file). If not provided, the engine searches for a matching library or triggers JIT compilation.
mode	`Literal["local", "interactive", "server"]`	No	The engine mode that determines automatic configuration of batch sizes and sequence lengths. Defaults to `"local"`.
engine_config	`Optional[EngineConfig]`	No	Additional configurable arguments for the MLC engine (e.g., `max_num_sequence`, `tensor_parallel_shards`).
enable_tracing	`bool`	No	Whether to enable event logging for request tracing. Defaults to `False`.

Outputs

Name	Type	Description
(instance)	`MLCEngine`	A fully initialized synchronous engine instance with `.chat.completions` and `.completions` interfaces ready for use.

Usage Examples

Basic Usage

from mlc_llm.serve import MLCEngine

# Initialize the engine with a model path
engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

# Use chat completions
response = engine.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"},
    ],
    max_tokens=256,
)
print(response.choices[0].message.content)

# Clean up
engine.terminate()

Advanced Usage with Engine Config

from mlc_llm.serve import MLCEngine
from mlc_llm.serve.config import EngineConfig

# Initialize with server mode and custom config
engine = MLCEngine(
    model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
    device="cuda:0",
    mode="server",
    engine_config=EngineConfig(
        max_num_sequence=16,
        tensor_parallel_shards=1,
    ),
    enable_tracing=True,
)

# Streaming chat completion
for chunk in engine.chat.completions.create(
    messages=[{"role": "user", "content": "Explain quantum computing."}],
    stream=True,
    max_tokens=512,
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

engine.terminate()

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Synchronous_Engine_Initialization

Environment Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment