Implementation:Mlc ai Mlc llm Sync Engine
| Knowledge Sources | |
|---|---|
| Domains | Serving Engine, Text Generation, LLM Inference |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
A synchronous Python wrapper around the MLC LLM C++ inference engine, providing a simple request-based interface for text generation primarily used for testing and debugging.
Description
The sync_engine module provides SyncMLCEngine, a synchronous (blocking) interface to the MLC LLM serving engine. Unlike the production async engine, this implementation directly wraps the C++ engine without multi-threading or OpenAI API compatibility, making it simpler but less suitable for production serving.
Key Components:
- _create_tvm_module: A helper function that instantiates a TVM module from a registered global function creator and extracts the named FFI (Foreign Function Interface) methods into a dictionary for convenient access.
- SyncMLCEngine: The main engine class that manages the full lifecycle of LLM inference:
- Initialization: Validates the engine configuration, parses model paths and libraries, detects the device (CUDA/Metal/etc.), loads model configs, creates the C++ engine via TVM FFI, initializes the tokenizer, and optionally sets up event trace recording. The FFI exposes methods: init, add_request, abort_request, step, reset, json_metrics, get/set_request_stream_callback, and create_request.
- generate: The primary batch generation method. Accepts one or more prompts (strings, token ID lists, or multi-modal data lists) and generation configs. It:
- Saves the current stream callback
- Installs a custom callback that accumulates generated tokens using
TextStreamerfor detokenization - Creates and adds requests to the engine
- Runs the step loop until all generations complete
- Restores the original callback
- Returns output texts and optional logprob strings
- create_request: Creates a request object from input data and generation config, delegating to the C++ engine's request factory.
- add_request: Submits a request to the engine for processing.
- abort_request: Cancels an in-progress generation by request ID.
- step: Executes a single engine step, which may involve prefilling new requests or decoding existing ones, and triggers callbacks for finished tokens.
- reset: Clears all engine state including running data and metrics.
- metrics: Returns engine performance metrics as an
EngineMetricsobject.
- metrics: Returns engine performance metrics as an
Usage
Use SyncMLCEngine for testing, debugging, and simple batch inference scenarios where synchronous blocking behavior is acceptable. It is not recommended for production serving due to the lack of async/multi-threaded request handling. The engine supports configurable modes ("local", "interactive", "server") that control engine parameters such as KV cache sizes.
Code Reference
Source Location
- Repository: Mlc_ai_Mlc_llm
- File: python/mlc_llm/serve/sync_engine.py
Signature
class SyncMLCEngine:
def __init__(
self,
model: str,
device: Union[str, tvm.runtime.Device] = "auto",
*,
model_lib: Optional[str] = None,
mode: Literal["local", "interactive", "server"] = "local",
engine_config: Optional[EngineConfig] = None,
enable_tracing: bool = False,
request_stream_callback: Optional[Callable[[List[data.RequestStreamOutput]], None]] = None,
)
def generate(
self,
prompts: Union[str, List[str], List[int], List[List[int]], List[List[data.Data]]],
generation_config: Union[GenerationConfig, List[GenerationConfig]],
) -> Tuple[List[List[str]], List[Optional[List[List[str]]]]]
def create_request(
self, request_id: str, inputs: Union[data.Data, List[data.Data]],
generation_config: GenerationConfig,
) -> Request
def add_request(self, request: Request) -> None
def abort_request(self, request_id: str) -> None
def step(self) -> None
def reset(self) -> None
def metrics(self) -> EngineMetrics
Import
from mlc_llm.serve.sync_engine import SyncMLCEngine
I/O Contract
__init__
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | str | (required) | Path or identifier for the model |
| device | Union[str, tvm.runtime.Device] | "auto" | Device to run inference on |
| model_lib | Optional[str] | None | Path to compiled model library |
| mode | Literal["local", "interactive", "server"] | "local" | Engine operating mode |
| engine_config | Optional[EngineConfig] | None | Additional engine configuration |
| enable_tracing | bool | False | Enable event trace recording |
| request_stream_callback | Optional[Callable] | None | Callback for streaming results |
generate
| Parameter | Type | Description |
|---|---|---|
| prompts | Union[str, List[str], List[int], List[List[int]], List[List[data.Data]]] | Input prompts (strings, token IDs, or multi-modal data) |
| generation_config | Union[GenerationConfig, List[GenerationConfig]] | Generation parameters (shared or per-prompt) |
| Return | Type | Description |
|---|---|---|
| output_texts | List[List[str]] | Generated texts per prompt; inner list length equals config.n |
| output_logprobs_str | List[Optional[List[List[str]]]] | Logprob JSON strings per token per prompt, or None |
Stream Callback Signature
| Parameter | Type | Description |
|---|---|---|
| delta_outputs | List[data.RequestStreamOutput] | List of stream outputs, each containing request_id, delta token IDs, extra prefix strings, logprobs, and finish reason |
Generation Flow
The generate method follows this sequence:
- Convert input prompts to
data.Dataobjects (TextData or TokenData) - Create per-prompt
TextStreamerinstances for incremental detokenization - Install a local callback that accumulates generated text via the streamers
- Submit all requests to the engine via
add_request - Loop calling
step()until all generations are finished - Restore the original stream callback
- Return accumulated output texts and logprobs
Usage Examples
from mlc_llm.serve.sync_engine import SyncMLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig
# Create the synchronous engine
engine = SyncMLCEngine(
model="/path/to/model",
device="cuda",
mode="local",
)
# Generate text for a single prompt
output_texts, output_logprobs = engine.generate(
prompts="What is machine learning?",
generation_config=GenerationConfig(
temperature=0.7,
top_p=0.9,
max_tokens=256,
),
)
print(output_texts[0][0])
# Generate text for multiple prompts
output_texts, _ = engine.generate(
prompts=["Hello world", "Tell me a joke"],
generation_config=GenerationConfig(temperature=1.0, n=2),
)
# Using the step-based interface with custom callback
def my_callback(delta_outputs):
for output in delta_outputs:
request_id, stream_outputs = output.unpack()
for stream_output in stream_outputs:
print(f"Request {request_id}: {stream_output.delta_token_ids}")
engine._ffi["set_request_stream_callback"](my_callback)
request = engine.create_request("req_0", [data.TextData("Hi")], gen_config)
engine.add_request(request)
while not done:
engine.step()