Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Sync Engine

From Leeroopedia


Knowledge Sources
Domains Serving Engine, Text Generation, LLM Inference
Last Updated 2026-02-09 19:00 GMT

Overview

A synchronous Python wrapper around the MLC LLM C++ inference engine, providing a simple request-based interface for text generation primarily used for testing and debugging.

Description

The sync_engine module provides SyncMLCEngine, a synchronous (blocking) interface to the MLC LLM serving engine. Unlike the production async engine, this implementation directly wraps the C++ engine without multi-threading or OpenAI API compatibility, making it simpler but less suitable for production serving.

Key Components:

  • _create_tvm_module: A helper function that instantiates a TVM module from a registered global function creator and extracts the named FFI (Foreign Function Interface) methods into a dictionary for convenient access.
  • SyncMLCEngine: The main engine class that manages the full lifecycle of LLM inference:
    • Initialization: Validates the engine configuration, parses model paths and libraries, detects the device (CUDA/Metal/etc.), loads model configs, creates the C++ engine via TVM FFI, initializes the tokenizer, and optionally sets up event trace recording. The FFI exposes methods: init, add_request, abort_request, step, reset, json_metrics, get/set_request_stream_callback, and create_request.
    • generate: The primary batch generation method. Accepts one or more prompts (strings, token ID lists, or multi-modal data lists) and generation configs. It:
      1. Saves the current stream callback
      2. Installs a custom callback that accumulates generated tokens using TextStreamer for detokenization
      3. Creates and adds requests to the engine
      4. Runs the step loop until all generations complete
      5. Restores the original callback
      6. Returns output texts and optional logprob strings
    • create_request: Creates a request object from input data and generation config, delegating to the C++ engine's request factory.
    • add_request: Submits a request to the engine for processing.
    • abort_request: Cancels an in-progress generation by request ID.
    • step: Executes a single engine step, which may involve prefilling new requests or decoding existing ones, and triggers callbacks for finished tokens.
    • reset: Clears all engine state including running data and metrics.
    • metrics: Returns engine performance metrics as an EngineMetrics object.

Usage

Use SyncMLCEngine for testing, debugging, and simple batch inference scenarios where synchronous blocking behavior is acceptable. It is not recommended for production serving due to the lack of async/multi-threaded request handling. The engine supports configurable modes ("local", "interactive", "server") that control engine parameters such as KV cache sizes.

Code Reference

Source Location

Signature

class SyncMLCEngine:
    def __init__(
        self,
        model: str,
        device: Union[str, tvm.runtime.Device] = "auto",
        *,
        model_lib: Optional[str] = None,
        mode: Literal["local", "interactive", "server"] = "local",
        engine_config: Optional[EngineConfig] = None,
        enable_tracing: bool = False,
        request_stream_callback: Optional[Callable[[List[data.RequestStreamOutput]], None]] = None,
    )

    def generate(
        self,
        prompts: Union[str, List[str], List[int], List[List[int]], List[List[data.Data]]],
        generation_config: Union[GenerationConfig, List[GenerationConfig]],
    ) -> Tuple[List[List[str]], List[Optional[List[List[str]]]]]

    def create_request(
        self, request_id: str, inputs: Union[data.Data, List[data.Data]],
        generation_config: GenerationConfig,
    ) -> Request

    def add_request(self, request: Request) -> None
    def abort_request(self, request_id: str) -> None
    def step(self) -> None
    def reset(self) -> None
    def metrics(self) -> EngineMetrics

Import

from mlc_llm.serve.sync_engine import SyncMLCEngine

I/O Contract

__init__

Parameter Type Default Description
model str (required) Path or identifier for the model
device Union[str, tvm.runtime.Device] "auto" Device to run inference on
model_lib Optional[str] None Path to compiled model library
mode Literal["local", "interactive", "server"] "local" Engine operating mode
engine_config Optional[EngineConfig] None Additional engine configuration
enable_tracing bool False Enable event trace recording
request_stream_callback Optional[Callable] None Callback for streaming results

generate

Parameter Type Description
prompts Union[str, List[str], List[int], List[List[int]], List[List[data.Data]]] Input prompts (strings, token IDs, or multi-modal data)
generation_config Union[GenerationConfig, List[GenerationConfig]] Generation parameters (shared or per-prompt)
Return Type Description
output_texts List[List[str]] Generated texts per prompt; inner list length equals config.n
output_logprobs_str List[Optional[List[List[str]]]] Logprob JSON strings per token per prompt, or None

Stream Callback Signature

Parameter Type Description
delta_outputs List[data.RequestStreamOutput] List of stream outputs, each containing request_id, delta token IDs, extra prefix strings, logprobs, and finish reason

Generation Flow

The generate method follows this sequence:

  1. Convert input prompts to data.Data objects (TextData or TokenData)
  2. Create per-prompt TextStreamer instances for incremental detokenization
  3. Install a local callback that accumulates generated text via the streamers
  4. Submit all requests to the engine via add_request
  5. Loop calling step() until all generations are finished
  6. Restore the original stream callback
  7. Return accumulated output texts and logprobs

Usage Examples

from mlc_llm.serve.sync_engine import SyncMLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig

# Create the synchronous engine
engine = SyncMLCEngine(
    model="/path/to/model",
    device="cuda",
    mode="local",
)

# Generate text for a single prompt
output_texts, output_logprobs = engine.generate(
    prompts="What is machine learning?",
    generation_config=GenerationConfig(
        temperature=0.7,
        top_p=0.9,
        max_tokens=256,
    ),
)
print(output_texts[0][0])

# Generate text for multiple prompts
output_texts, _ = engine.generate(
    prompts=["Hello world", "Tell me a joke"],
    generation_config=GenerationConfig(temperature=1.0, n=2),
)

# Using the step-based interface with custom callback
def my_callback(delta_outputs):
    for output in delta_outputs:
        request_id, stream_outputs = output.unpack()
        for stream_output in stream_outputs:
            print(f"Request {request_id}: {stream_output.delta_token_ids}")

engine._ffi["set_request_stream_callback"](my_callback)
request = engine.create_request("req_0", [data.TextData("Hi")], gen_config)
engine.add_request(request)
while not done:
    engine.step()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment