Implementation:Mlc ai Mlc llm MLCEngine Generate

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, LLM_Inference
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for processing LLM outputs as incremental token streams rather than waiting for complete generation, provided by MLC-LLM.

Description

MLCEngine._generate is the internal synchronous text generation method that drives the streaming pipeline. It converts the input prompt to data, submits a request to the background engine, and then iteratively yields lists of CallbackStreamOutput objects as tokens are generated. Each yielded list contains one element per parallel generation (as specified by generation_config.n), except for the final chunk which carries usage statistics.

The method manages the lifecycle of a single generation request: it creates the request via the FFI layer, initializes a synchronous output queue and per-stream TextStreamer instances, adds the request to the engine, and then enters a blocking loop that reads delta outputs from the queue. An ErrorCleanupScope ensures the request is properly aborted if an exception occurs during iteration.

This method is the low-level building block that higher-level interfaces (_chat_completion, _completion) delegate to for actual token generation.

Usage

_generate is an internal method (prefixed with underscore) and is not intended for direct external use. It is called by the engine's chat completion and text completion handlers. Understanding it is useful for debugging generation behavior, extending the engine, or building custom response processing pipelines.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/serve/engine.py (lines 1856-1926)

Signature

def _generate(
    self,
    prompt: Union[str, List[int], List[Union[str, List[int], data.Data]]],
    generation_config: GenerationConfig,
    request_id: str,
) -> Iterator[List[engine_base.CallbackStreamOutput]]:

Import

from mlc_llm.serve.engine import MLCEngine

# _generate is an internal method accessed via an engine instance:
engine = MLCEngine(model="path/to/model")
# engine._generate(prompt, generation_config, request_id)

I/O Contract

Inputs

Name	Type	Required	Description
prompt	`Union[str, List[int], List[Union[str, List[int], data.Data]]]`	Yes	The input prompt, which can be a text string, a list of token IDs, or a list of mixed text strings, token ID lists, and `data.Data` instances (for multimodal inputs).
generation_config	`GenerationConfig`	Yes	Configuration controlling generation behavior, including `n` (number of parallel generations), `temperature`, `top_p`, `max_tokens`, `stop` sequences, and other sampling parameters.
request_id	`str`	Yes	A unique string identifier for this generation request, used for tracking, logging, and abort operations.

Outputs

Name	Type	Description
request_output	`Iterator[List[CallbackStreamOutput]]`	An iterator that yields lists of `CallbackStreamOutput` objects. Each `CallbackStreamOutput` contains: `delta_text` (incremental generated text), `delta_logprob_json_strs` (optional log probability data), `finish_reason` (one of `"stop"`, `"length"`, or `None`), and `request_final_usage_json_str` (usage stats on the final chunk).

Usage Examples

Basic Usage

from mlc_llm.serve.engine import MLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig

engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

# Create a generation config
gen_config = GenerationConfig(
    temperature=0.7,
    top_p=0.95,
    max_tokens=128,
    n=1,
)

# Use the internal _generate method for streaming
request_id = "req-001"
for output_list in engine._generate(
    prompt="Once upon a time",
    generation_config=gen_config,
    request_id=request_id,
):
    for output in output_list:
        if output.delta_text:
            print(output.delta_text, end="", flush=True)
        if output.finish_reason is not None:
            print(f"\n[Finished: {output.finish_reason}]")
        if output.request_final_usage_json_str is not None:
            print(f"[Usage: {output.request_final_usage_json_str}]")

engine.terminate()

Multiple Parallel Generations

from mlc_llm.serve.engine import MLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig

engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

gen_config = GenerationConfig(
    temperature=1.0,
    max_tokens=64,
    n=3,  # Generate 3 parallel completions
)

# Collect outputs for each parallel generation
texts = [""] * 3
for output_list in engine._generate(
    prompt="The meaning of life is",
    generation_config=gen_config,
    request_id="req-parallel-001",
):
    for i, output in enumerate(output_list):
        if i < len(texts) and output.delta_text:
            texts[i] += output.delta_text

for i, text in enumerate(texts):
    print(f"Generation {i}: {text}")

engine.terminate()

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Streaming_Response_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment