Implementation:Mlc ai Mlc llm MLCEngine Generate
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, LLM_Inference |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for processing LLM outputs as incremental token streams rather than waiting for complete generation, provided by MLC-LLM.
Description
MLCEngine._generate is the internal synchronous text generation method that drives the streaming pipeline. It converts the input prompt to data, submits a request to the background engine, and then iteratively yields lists of CallbackStreamOutput objects as tokens are generated. Each yielded list contains one element per parallel generation (as specified by generation_config.n), except for the final chunk which carries usage statistics.
The method manages the lifecycle of a single generation request: it creates the request via the FFI layer, initializes a synchronous output queue and per-stream TextStreamer instances, adds the request to the engine, and then enters a blocking loop that reads delta outputs from the queue. An ErrorCleanupScope ensures the request is properly aborted if an exception occurs during iteration.
This method is the low-level building block that higher-level interfaces (_chat_completion, _completion) delegate to for actual token generation.
Usage
_generate is an internal method (prefixed with underscore) and is not intended for direct external use. It is called by the engine's chat completion and text completion handlers. Understanding it is useful for debugging generation behavior, extending the engine, or building custom response processing pipelines.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/serve/engine.py(lines 1856-1926)
Signature
def _generate(
self,
prompt: Union[str, List[int], List[Union[str, List[int], data.Data]]],
generation_config: GenerationConfig,
request_id: str,
) -> Iterator[List[engine_base.CallbackStreamOutput]]:
Import
from mlc_llm.serve.engine import MLCEngine
# _generate is an internal method accessed via an engine instance:
engine = MLCEngine(model="path/to/model")
# engine._generate(prompt, generation_config, request_id)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | Union[str, List[int], List[Union[str, List[int], data.Data]]] |
Yes | The input prompt, which can be a text string, a list of token IDs, or a list of mixed text strings, token ID lists, and data.Data instances (for multimodal inputs).
|
| generation_config | GenerationConfig |
Yes | Configuration controlling generation behavior, including n (number of parallel generations), temperature, top_p, max_tokens, stop sequences, and other sampling parameters.
|
| request_id | str |
Yes | A unique string identifier for this generation request, used for tracking, logging, and abort operations. |
Outputs
| Name | Type | Description |
|---|---|---|
| request_output | Iterator[List[CallbackStreamOutput]] |
An iterator that yields lists of CallbackStreamOutput objects. Each CallbackStreamOutput contains: delta_text (incremental generated text), delta_logprob_json_strs (optional log probability data), finish_reason (one of "stop", "length", or None), and request_final_usage_json_str (usage stats on the final chunk).
|
Usage Examples
Basic Usage
from mlc_llm.serve.engine import MLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig
engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")
# Create a generation config
gen_config = GenerationConfig(
temperature=0.7,
top_p=0.95,
max_tokens=128,
n=1,
)
# Use the internal _generate method for streaming
request_id = "req-001"
for output_list in engine._generate(
prompt="Once upon a time",
generation_config=gen_config,
request_id=request_id,
):
for output in output_list:
if output.delta_text:
print(output.delta_text, end="", flush=True)
if output.finish_reason is not None:
print(f"\n[Finished: {output.finish_reason}]")
if output.request_final_usage_json_str is not None:
print(f"[Usage: {output.request_final_usage_json_str}]")
engine.terminate()
Multiple Parallel Generations
from mlc_llm.serve.engine import MLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig
engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")
gen_config = GenerationConfig(
temperature=1.0,
max_tokens=64,
n=3, # Generate 3 parallel completions
)
# Collect outputs for each parallel generation
texts = [""] * 3
for output_list in engine._generate(
prompt="The meaning of life is",
generation_config=gen_config,
request_id="req-parallel-001",
):
for i, output in enumerate(output_list):
if i < len(texts) and output.delta_text:
texts[i] += output.delta_text
for i, text in enumerate(texts):
print(f"Generation {i}: {text}")
engine.terminate()