Implementation:Mlc ai Mlc llm Sync Engine

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	Serving Engine, Text Generation, LLM Inference
Last Updated	2026-02-09 19:00 GMT

Overview

A synchronous Python wrapper around the MLC LLM C++ inference engine, providing a simple request-based interface for text generation primarily used for testing and debugging.

Description

The sync_engine module provides SyncMLCEngine, a synchronous (blocking) interface to the MLC LLM serving engine. Unlike the production async engine, this implementation directly wraps the C++ engine without multi-threading or OpenAI API compatibility, making it simpler but less suitable for production serving.

Key Components:

_create_tvm_module: A helper function that instantiates a TVM module from a registered global function creator and extracts the named FFI (Foreign Function Interface) methods into a dictionary for convenient access.

SyncMLCEngine: The main engine class that manages the full lifecycle of LLM inference:

- Initialization: Validates the engine configuration, parses model paths and libraries, detects the device (CUDA/Metal/etc.), loads model configs, creates the C++ engine via TVM FFI, initializes the tokenizer, and optionally sets up event trace recording. The FFI exposes methods: init, add_request, abort_request, step, reset, json_metrics, get/set_request_stream_callback, and create_request.

- generate: The primary batch generation method. Accepts one or more prompts (strings, token ID lists, or multi-modal data lists) and generation configs. It:

1. 1. Saves the current stream callback
  2. Installs a custom callback that accumulates generated tokens using TextStreamer for detokenization
  3. Creates and adds requests to the engine
  4. Runs the step loop until all generations complete
  5. Restores the original callback
  6. Returns output texts and optional logprob strings

- create_request: Creates a request object from input data and generation config, delegating to the C++ engine's request factory.

- add_request: Submits a request to the engine for processing.

- abort_request: Cancels an in-progress generation by request ID.

- step: Executes a single engine step, which may involve prefilling new requests or decoding existing ones, and triggers callbacks for finished tokens.

- reset: Clears all engine state including running data and metrics.

- metrics: Returns engine performance metrics as an EngineMetrics object.

Usage

Use SyncMLCEngine for testing, debugging, and simple batch inference scenarios where synchronous blocking behavior is acceptable. It is not recommended for production serving due to the lack of async/multi-threaded request handling. The engine supports configurable modes ("local", "interactive", "server") that control engine parameters such as KV cache sizes.

Code Reference

Source Location

Repository: Mlc_ai_Mlc_llm
File: python/mlc_llm/serve/sync_engine.py

Signature

class SyncMLCEngine:
    def __init__(
        self,
        model: str,
        device: Union[str, tvm.runtime.Device] = "auto",
        *,
        model_lib: Optional[str] = None,
        mode: Literal["local", "interactive", "server"] = "local",
        engine_config: Optional[EngineConfig] = None,
        enable_tracing: bool = False,
        request_stream_callback: Optional[Callable[[List[data.RequestStreamOutput]], None]] = None,
    )

    def generate(
        self,
        prompts: Union[str, List[str], List[int], List[List[int]], List[List[data.Data]]],
        generation_config: Union[GenerationConfig, List[GenerationConfig]],
    ) -> Tuple[List[List[str]], List[Optional[List[List[str]]]]]

    def create_request(
        self, request_id: str, inputs: Union[data.Data, List[data.Data]],
        generation_config: GenerationConfig,
    ) -> Request

    def add_request(self, request: Request) -> None
    def abort_request(self, request_id: str) -> None
    def step(self) -> None
    def reset(self) -> None
    def metrics(self) -> EngineMetrics

Import

from mlc_llm.serve.sync_engine import SyncMLCEngine

I/O Contract

init

Parameter	Type	Default	Description
model	str	(required)	Path or identifier for the model
device	Union[str, tvm.runtime.Device]	"auto"	Device to run inference on
model_lib	Optional[str]	None	Path to compiled model library
mode	Literal["local", "interactive", "server"]	"local"	Engine operating mode
engine_config	Optional[EngineConfig]	None	Additional engine configuration
enable_tracing	bool	False	Enable event trace recording
request_stream_callback	Optional[Callable]	None	Callback for streaming results

generate

Parameter	Type	Description
prompts	Union[str, List[str], List[int], List[List[int]], List[List[data.Data]]]	Input prompts (strings, token IDs, or multi-modal data)
generation_config	Union[GenerationConfig, List[GenerationConfig]]	Generation parameters (shared or per-prompt)

Return	Type	Description
output_texts	List[List[str]]	Generated texts per prompt; inner list length equals config.n
output_logprobs_str	List[Optional[List[List[str]]]]	Logprob JSON strings per token per prompt, or None

Stream Callback Signature

Parameter	Type	Description
delta_outputs	List[data.RequestStreamOutput]	List of stream outputs, each containing request_id, delta token IDs, extra prefix strings, logprobs, and finish reason

Generation Flow

The generate method follows this sequence:

Convert input prompts to data.Data objects (TextData or TokenData)
Create per-prompt TextStreamer instances for incremental detokenization
Install a local callback that accumulates generated text via the streamers
Submit all requests to the engine via add_request
Loop calling step() until all generations are finished
Restore the original stream callback
Return accumulated output texts and logprobs

Usage Examples

from mlc_llm.serve.sync_engine import SyncMLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig

# Create the synchronous engine
engine = SyncMLCEngine(
    model="/path/to/model",
    device="cuda",
    mode="local",
)

# Generate text for a single prompt
output_texts, output_logprobs = engine.generate(
    prompts="What is machine learning?",
    generation_config=GenerationConfig(
        temperature=0.7,
        top_p=0.9,
        max_tokens=256,
    ),
)
print(output_texts[0][0])

# Generate text for multiple prompts
output_texts, _ = engine.generate(
    prompts=["Hello world", "Tell me a joke"],
    generation_config=GenerationConfig(temperature=1.0, n=2),
)

# Using the step-based interface with custom callback
def my_callback(delta_outputs):
    for output in delta_outputs:
        request_id, stream_outputs = output.unpack()
        for stream_output in stream_outputs:
            print(f"Request {request_id}: {stream_output.delta_token_ids}")

engine._ffi["set_request_stream_callback"](my_callback)
request = engine.create_request("req_0", [data.TextData("Hi")], gen_config)
engine.add_request(request)
while not done:
    engine.step()

Related Pages

Implementation:Mlc_ai_Mlc_llm_Gemma2_Model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment