Implementation:Mlc ai Mlc llm MLCEngineBase Init
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Serving, Systems_Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for initializing the MLC-LLM inference engine with advanced serving optimizations provided by MLC-LLM.
Description
MLCEngineBase.__init__ is the base constructor shared by both AsyncMLCEngine (asynchronous) and MLCEngine (synchronous). It performs the complete engine initialization sequence:
- Config Validation: Checks that the
engine_configfields do not conflict with the top-levelmodel,model_lib, andmodearguments. Validates thatkv_cache_page_sizeis 16. - Model Resolution: Parses model paths and additional models into
ModelInfoobjects. If a pre-compiled model library is provided, it is verified to exist on disk. Otherwise, JIT compilation is triggered viamlc_llm.interface.jit.jit(). - Device Detection: If the device is given as a string (e.g.,
"cuda","auto"), it is resolved to atvm.runtime.Deviceobject via automatic GPU detection. - Model Config Loading: Reads
mlc-chat-config.jsonfor each model to populate model configuration dictionaries and extract the conversation template. - Engine State Creation: Initializes an
EngineStateobject that manages async/sync stream callbacks and optional event tracing. - C++ Engine Instantiation: Creates the underlying threaded C++ engine via TVM's global function registry (
mlc.serve.create_threaded_engine). Binds FFI functions foradd_request,abort_request,run_background_loop,run_background_stream_back_loop,reload,init_threaded_engine,exit_background_loop,create_request,get_complete_engine_config,reset, anddebug_call_func_on_all_worker. - Tokenizer Loading: Initializes a
Tokenizerfrom the primary model path. - Background Threads: Starts two daemon threads -- one for the main inference loop and one for the stream-back loop that delivers generated tokens to the Python layer.
- Engine Reload: Serializes the finalized
EngineConfigto JSON and passes it to the C++ engine'sreloadfunction, which allocates KV cache, sets up the scheduler, and prepares the model for inference. - Final Config Query: Retrieves the actual engine configuration (with auto-inferred values filled in) via
get_complete_engine_config, and computesmax_input_sequence_lengthas the minimum ofmax_single_sequence_lengthandmax_total_sequence_length.
Usage
This constructor is not called directly by users. Instead, use AsyncMLCEngine for async serving or MLCEngine for synchronous usage. Both classes inherit from MLCEngineBase and pass either "async" or "sync" as the kind parameter, which determines the callback mechanism used for streaming results.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/serve/engine_base.py(Lines 568-652)
Signature
class MLCEngineBase:
def __init__(
self,
kind: Literal["async", "sync"],
model: str,
device: Union[str, tvm.runtime.Device],
model_lib: Optional[str],
mode: Literal["local", "interactive", "server"],
engine_config: Optional[EngineConfig],
enable_tracing: bool,
) -> None:
Import
from mlc_llm.serve.engine_base import MLCEngineBase
# Or, more commonly, use the subclasses:
from mlc_llm.serve.engine import AsyncMLCEngine, MLCEngine
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| kind | Literal["async", "sync"] |
Yes | Whether the engine operates in asynchronous or synchronous mode. Determines the callback mechanism for streaming results. |
| model | str |
Yes | Path to the model directory containing mlc-chat-config.json, or an MLC model identifier, or a HuggingFace repository URL.
|
| device | Union[str, tvm.runtime.Device] |
Yes | Target device for inference. Accepts string identifiers ("cuda", "cuda:0", "metal", "auto") or a tvm.runtime.Device object.
|
| model_lib | Optional[str] |
No | Path to the pre-compiled model library file (e.g., .so). If None, JIT compilation is triggered automatically.
|
| mode | Literal["local", "interactive", "server"] |
Yes | Engine mode preset. "local" sets batch size to 4, "interactive" sets batch size to 1, "server" auto-infers maximum capacity.
|
| engine_config | Optional[EngineConfig] |
No | Additional engine configuration. If None, a default EngineConfig() is created. Explicit fields override mode-based defaults.
|
| enable_tracing | bool |
Yes | Whether to enable the event trace recorder for Chrome-compatible tracing output. |
Outputs
| Name | Type | Description |
|---|---|---|
| (initialized instance) | MLCEngineBase |
The constructor initializes the engine in place. Key attributes set on the instance include: conv_template (conversation template), model_config_dicts (raw model configs), state (EngineState), tokenizer (Tokenizer), engine_config (finalized EngineConfig), and max_input_sequence_length (int).
|
Usage Examples
Basic Usage via AsyncMLCEngine
from mlc_llm.serve.engine import AsyncMLCEngine
# AsyncMLCEngine calls MLCEngineBase.__init__ with kind="async"
engine = AsyncMLCEngine(
model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
device="cuda",
mode="local",
)
# Use the engine for async chat completions
import asyncio
async def main():
response = await engine.chat.completions.create(
messages=[{"role": "user", "content": "Hello!"}],
model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
)
print(response.choices[0].message.content)
engine.terminate()
asyncio.run(main())
With Advanced Configuration
from mlc_llm.serve.engine import AsyncMLCEngine
from mlc_llm.serve.config import EngineConfig
# Configure speculative decoding and prefix caching
engine = AsyncMLCEngine(
model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
device="cuda",
mode="server",
engine_config=EngineConfig(
gpu_memory_utilization=0.90,
max_num_sequence=32,
speculative_mode="eagle",
spec_draft_length=4,
prefix_cache_mode="radix",
prefill_mode="hybrid",
),
enable_tracing=True,
)
Synchronous Engine Usage
from mlc_llm.serve.engine import MLCEngine
# MLCEngine calls MLCEngineBase.__init__ with kind="sync"
engine = MLCEngine(
model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
device="cuda",
mode="interactive",
)
response = engine.chat.completions.create(
messages=[{"role": "user", "content": "What is 2 + 2?"}],
model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
)
print(response.choices[0].message.content)
engine.terminate()
Related Pages
Implements Principle
Environment and Heuristic Links
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment