Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Mlc llm MLCEngineBase Init

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Serving, Systems_Engineering
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for initializing the MLC-LLM inference engine with advanced serving optimizations provided by MLC-LLM.

Description

MLCEngineBase.__init__ is the base constructor shared by both AsyncMLCEngine (asynchronous) and MLCEngine (synchronous). It performs the complete engine initialization sequence:

  1. Config Validation: Checks that the engine_config fields do not conflict with the top-level model, model_lib, and mode arguments. Validates that kv_cache_page_size is 16.
  2. Model Resolution: Parses model paths and additional models into ModelInfo objects. If a pre-compiled model library is provided, it is verified to exist on disk. Otherwise, JIT compilation is triggered via mlc_llm.interface.jit.jit().
  3. Device Detection: If the device is given as a string (e.g., "cuda", "auto"), it is resolved to a tvm.runtime.Device object via automatic GPU detection.
  4. Model Config Loading: Reads mlc-chat-config.json for each model to populate model configuration dictionaries and extract the conversation template.
  5. Engine State Creation: Initializes an EngineState object that manages async/sync stream callbacks and optional event tracing.
  6. C++ Engine Instantiation: Creates the underlying threaded C++ engine via TVM's global function registry (mlc.serve.create_threaded_engine). Binds FFI functions for add_request, abort_request, run_background_loop, run_background_stream_back_loop, reload, init_threaded_engine, exit_background_loop, create_request, get_complete_engine_config, reset, and debug_call_func_on_all_worker.
  7. Tokenizer Loading: Initializes a Tokenizer from the primary model path.
  8. Background Threads: Starts two daemon threads -- one for the main inference loop and one for the stream-back loop that delivers generated tokens to the Python layer.
  9. Engine Reload: Serializes the finalized EngineConfig to JSON and passes it to the C++ engine's reload function, which allocates KV cache, sets up the scheduler, and prepares the model for inference.
  10. Final Config Query: Retrieves the actual engine configuration (with auto-inferred values filled in) via get_complete_engine_config, and computes max_input_sequence_length as the minimum of max_single_sequence_length and max_total_sequence_length.

Usage

This constructor is not called directly by users. Instead, use AsyncMLCEngine for async serving or MLCEngine for synchronous usage. Both classes inherit from MLCEngineBase and pass either "async" or "sync" as the kind parameter, which determines the callback mechanism used for streaming results.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/serve/engine_base.py (Lines 568-652)

Signature

class MLCEngineBase:
    def __init__(
        self,
        kind: Literal["async", "sync"],
        model: str,
        device: Union[str, tvm.runtime.Device],
        model_lib: Optional[str],
        mode: Literal["local", "interactive", "server"],
        engine_config: Optional[EngineConfig],
        enable_tracing: bool,
    ) -> None:

Import

from mlc_llm.serve.engine_base import MLCEngineBase
# Or, more commonly, use the subclasses:
from mlc_llm.serve.engine import AsyncMLCEngine, MLCEngine

I/O Contract

Inputs

Name Type Required Description
kind Literal["async", "sync"] Yes Whether the engine operates in asynchronous or synchronous mode. Determines the callback mechanism for streaming results.
model str Yes Path to the model directory containing mlc-chat-config.json, or an MLC model identifier, or a HuggingFace repository URL.
device Union[str, tvm.runtime.Device] Yes Target device for inference. Accepts string identifiers ("cuda", "cuda:0", "metal", "auto") or a tvm.runtime.Device object.
model_lib Optional[str] No Path to the pre-compiled model library file (e.g., .so). If None, JIT compilation is triggered automatically.
mode Literal["local", "interactive", "server"] Yes Engine mode preset. "local" sets batch size to 4, "interactive" sets batch size to 1, "server" auto-infers maximum capacity.
engine_config Optional[EngineConfig] No Additional engine configuration. If None, a default EngineConfig() is created. Explicit fields override mode-based defaults.
enable_tracing bool Yes Whether to enable the event trace recorder for Chrome-compatible tracing output.

Outputs

Name Type Description
(initialized instance) MLCEngineBase The constructor initializes the engine in place. Key attributes set on the instance include: conv_template (conversation template), model_config_dicts (raw model configs), state (EngineState), tokenizer (Tokenizer), engine_config (finalized EngineConfig), and max_input_sequence_length (int).

Usage Examples

Basic Usage via AsyncMLCEngine

from mlc_llm.serve.engine import AsyncMLCEngine

# AsyncMLCEngine calls MLCEngineBase.__init__ with kind="async"
engine = AsyncMLCEngine(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    device="cuda",
    mode="local",
)

# Use the engine for async chat completions
import asyncio

async def main():
    response = await engine.chat.completions.create(
        messages=[{"role": "user", "content": "Hello!"}],
        model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    )
    print(response.choices[0].message.content)
    engine.terminate()

asyncio.run(main())

With Advanced Configuration

from mlc_llm.serve.engine import AsyncMLCEngine
from mlc_llm.serve.config import EngineConfig

# Configure speculative decoding and prefix caching
engine = AsyncMLCEngine(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    device="cuda",
    mode="server",
    engine_config=EngineConfig(
        gpu_memory_utilization=0.90,
        max_num_sequence=32,
        speculative_mode="eagle",
        spec_draft_length=4,
        prefix_cache_mode="radix",
        prefill_mode="hybrid",
    ),
    enable_tracing=True,
)

Synchronous Engine Usage

from mlc_llm.serve.engine import MLCEngine

# MLCEngine calls MLCEngineBase.__init__ with kind="sync"
engine = MLCEngine(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    device="cuda",
    mode="interactive",
)

response = engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
)
print(response.choices[0].message.content)
engine.terminate()

Related Pages

Implements Principle

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment