Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Mlc llm MLCEngine Init

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, LLM_Inference
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for initializing a synchronous inference engine that loads compiled model artifacts and provides a blocking API for LLM inference, provided by MLC-LLM.

Description

MLCEngine.__init__ constructs a synchronous LLM inference engine. It extends MLCEngineBase (passing "sync" as the engine kind) and attaches the Chat and Completion proxy objects that expose OpenAI-compatible interfaces. Internally, the base class performs model resolution, device detection, optional JIT compilation, tokenizer loading, background thread creation, and engine configuration. Once the constructor returns, the engine is fully initialized and ready to serve blocking inference requests through engine.chat.completions.create() or engine.completions.create().

The engine supports three preset modes: "local" (low concurrency, up to 4 sequences), "interactive" (single sequence at a time), and "server" (maximized GPU memory utilization for high throughput). These presets automatically configure max_num_sequence, max_total_sequence_length, and prefill_chunk_size.

Usage

Use MLCEngine for programmatic Python inference when you need a simple blocking API. It is ideal for scripts, notebooks, batch processing, and applications where you do not need async concurrency. For async usage (e.g., in web servers), use AsyncMLCEngine instead.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/serve/engine.py (lines 1460-1478)

Signature

class MLCEngine(engine_base.MLCEngineBase):
    def __init__(
        self,
        model: str,
        device: Union[str, Device] = "auto",
        *,
        model_lib: Optional[str] = None,
        mode: Literal["local", "interactive", "server"] = "local",
        engine_config: Optional[EngineConfig] = None,
        enable_tracing: bool = False,
    ) -> None:

Import

from mlc_llm.serve import MLCEngine
# or
from mlc_llm.serve.engine import MLCEngine

I/O Contract

Inputs

Name Type Required Description
model str Yes A path to mlc-chat-config.json, an MLC model directory, or a Hugging Face repository link pointing to an MLC-compiled model.
device Union[str, Device] No The device used to deploy the model (e.g., "cuda", "cuda:0", "metal", "vulkan"). Defaults to "auto" which auto-detects available GPUs.
model_lib Optional[str] No The full path to the compiled model library file (e.g., a .so file). If not provided, the engine searches for a matching library or triggers JIT compilation.
mode Literal["local", "interactive", "server"] No The engine mode that determines automatic configuration of batch sizes and sequence lengths. Defaults to "local".
engine_config Optional[EngineConfig] No Additional configurable arguments for the MLC engine (e.g., max_num_sequence, tensor_parallel_shards).
enable_tracing bool No Whether to enable event logging for request tracing. Defaults to False.

Outputs

Name Type Description
(instance) MLCEngine A fully initialized synchronous engine instance with .chat.completions and .completions interfaces ready for use.

Usage Examples

Basic Usage

from mlc_llm.serve import MLCEngine

# Initialize the engine with a model path
engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

# Use chat completions
response = engine.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"},
    ],
    max_tokens=256,
)
print(response.choices[0].message.content)

# Clean up
engine.terminate()

Advanced Usage with Engine Config

from mlc_llm.serve import MLCEngine
from mlc_llm.serve.config import EngineConfig

# Initialize with server mode and custom config
engine = MLCEngine(
    model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC",
    device="cuda:0",
    mode="server",
    engine_config=EngineConfig(
        max_num_sequence=16,
        tensor_parallel_shards=1,
    ),
    enable_tracing=True,
)

# Streaming chat completion
for chunk in engine.chat.completions.create(
    messages=[{"role": "user", "content": "Explain quantum computing."}],
    stream=True,
    max_tokens=512,
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

engine.terminate()

Related Pages

Implements Principle

Environment Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment