Implementation:Mlc ai Mlc llm MLCEngineBase Init

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Model_Serving, Systems_Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for initializing the MLC-LLM inference engine with advanced serving optimizations provided by MLC-LLM.

Description

MLCEngineBase.__init__ is the base constructor shared by both AsyncMLCEngine (asynchronous) and MLCEngine (synchronous). It performs the complete engine initialization sequence:

Config Validation: Checks that the engine_config fields do not conflict with the top-level model, model_lib, and mode arguments. Validates that kv_cache_page_size is 16.
Model Resolution: Parses model paths and additional models into ModelInfo objects. If a pre-compiled model library is provided, it is verified to exist on disk. Otherwise, JIT compilation is triggered via mlc_llm.interface.jit.jit().
Device Detection: If the device is given as a string (e.g., "cuda", "auto"), it is resolved to a tvm.runtime.Device object via automatic GPU detection.
Model Config Loading: Reads mlc-chat-config.json for each model to populate model configuration dictionaries and extract the conversation template.
Engine State Creation: Initializes an EngineState object that manages async/sync stream callbacks and optional event tracing.
C++ Engine Instantiation: Creates the underlying threaded C++ engine via TVM's global function registry (mlc.serve.create_threaded_engine). Binds FFI functions for add_request, abort_request, run_background_loop, run_background_stream_back_loop, reload, init_threaded_engine, exit_background_loop, create_request, get_complete_engine_config, reset, and debug_call_func_on_all_worker.
Tokenizer Loading: Initializes a Tokenizer from the primary model path.
Background Threads: Starts two daemon threads -- one for the main inference loop and one for the stream-back loop that delivers generated tokens to the Python layer.
Engine Reload: Serializes the finalized EngineConfig to JSON and passes it to the C++ engine's reload function, which allocates KV cache, sets up the scheduler, and prepares the model for inference.
Final Config Query: Retrieves the actual engine configuration (with auto-inferred values filled in) via get_complete_engine_config, and computes max_input_sequence_length as the minimum of max_single_sequence_length and max_total_sequence_length.

Usage

This constructor is not called directly by users. Instead, use AsyncMLCEngine for async serving or MLCEngine for synchronous usage. Both classes inherit from MLCEngineBase and pass either "async" or "sync" as the kind parameter, which determines the callback mechanism used for streaming results.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/serve/engine_base.py (Lines 568-652)

Signature

class MLCEngineBase:
    def __init__(
        self,
        kind: Literal["async", "sync"],
        model: str,
        device: Union[str, tvm.runtime.Device],
        model_lib: Optional[str],
        mode: Literal["local", "interactive", "server"],
        engine_config: Optional[EngineConfig],
        enable_tracing: bool,
    ) -> None:

Import

from mlc_llm.serve.engine_base import MLCEngineBase
# Or, more commonly, use the subclasses:
from mlc_llm.serve.engine import AsyncMLCEngine, MLCEngine

I/O Contract

Inputs

Name	Type	Required	Description
kind	`Literal["async", "sync"]`	Yes	Whether the engine operates in asynchronous or synchronous mode. Determines the callback mechanism for streaming results.
model	`str`	Yes	Path to the model directory containing `mlc-chat-config.json`, or an MLC model identifier, or a HuggingFace repository URL.
device	`Union[str, tvm.runtime.Device]`	Yes	Target device for inference. Accepts string identifiers (`"cuda"`, `"cuda:0"`, `"metal"`, `"auto"`) or a `tvm.runtime.Device` object.
model_lib	`Optional[str]`	No	Path to the pre-compiled model library file (e.g., `.so`). If `None`, JIT compilation is triggered automatically.
mode	`Literal["local", "interactive", "server"]`	Yes	Engine mode preset. `"local"` sets batch size to 4, `"interactive"` sets batch size to 1, `"server"` auto-infers maximum capacity.
engine_config	`Optional[EngineConfig]`	No	Additional engine configuration. If `None`, a default `EngineConfig()` is created. Explicit fields override mode-based defaults.
enable_tracing	`bool`	Yes	Whether to enable the event trace recorder for Chrome-compatible tracing output.

Outputs

Name	Type	Description
(initialized instance)	`MLCEngineBase`	The constructor initializes the engine in place. Key attributes set on the instance include: `conv_template` (conversation template), `model_config_dicts` (raw model configs), `state` (`EngineState`), `tokenizer` (`Tokenizer`), `engine_config` (finalized `EngineConfig`), and `max_input_sequence_length` (int).

Usage Examples

Basic Usage via AsyncMLCEngine

from mlc_llm.serve.engine import AsyncMLCEngine

# AsyncMLCEngine calls MLCEngineBase.__init__ with kind="async"
engine = AsyncMLCEngine(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    device="cuda",
    mode="local",
)

# Use the engine for async chat completions
import asyncio

async def main():
    response = await engine.chat.completions.create(
        messages=[{"role": "user", "content": "Hello!"}],
        model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    )
    print(response.choices[0].message.content)
    engine.terminate()

asyncio.run(main())

With Advanced Configuration

from mlc_llm.serve.engine import AsyncMLCEngine
from mlc_llm.serve.config import EngineConfig

# Configure speculative decoding and prefix caching
engine = AsyncMLCEngine(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    device="cuda",
    mode="server",
    engine_config=EngineConfig(
        gpu_memory_utilization=0.90,
        max_num_sequence=32,
        speculative_mode="eagle",
        spec_draft_length=4,
        prefix_cache_mode="radix",
        prefill_mode="hybrid",
    ),
    enable_tracing=True,
)

Synchronous Engine Usage

from mlc_llm.serve.engine import MLCEngine

# MLCEngine calls MLCEngineBase.__init__ with kind="sync"
engine = MLCEngine(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    device="cuda",
    mode="interactive",
)

response = engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
)
print(response.choices[0].message.content)
engine.terminate()

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Advanced_Serving_Features

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment