Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm MLCEngine Generate

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, LLM_Inference
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for processing LLM outputs as incremental token streams rather than waiting for complete generation, provided by MLC-LLM.

Description

MLCEngine._generate is the internal synchronous text generation method that drives the streaming pipeline. It converts the input prompt to data, submits a request to the background engine, and then iteratively yields lists of CallbackStreamOutput objects as tokens are generated. Each yielded list contains one element per parallel generation (as specified by generation_config.n), except for the final chunk which carries usage statistics.

The method manages the lifecycle of a single generation request: it creates the request via the FFI layer, initializes a synchronous output queue and per-stream TextStreamer instances, adds the request to the engine, and then enters a blocking loop that reads delta outputs from the queue. An ErrorCleanupScope ensures the request is properly aborted if an exception occurs during iteration.

This method is the low-level building block that higher-level interfaces (_chat_completion, _completion) delegate to for actual token generation.

Usage

_generate is an internal method (prefixed with underscore) and is not intended for direct external use. It is called by the engine's chat completion and text completion handlers. Understanding it is useful for debugging generation behavior, extending the engine, or building custom response processing pipelines.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/serve/engine.py (lines 1856-1926)

Signature

def _generate(
    self,
    prompt: Union[str, List[int], List[Union[str, List[int], data.Data]]],
    generation_config: GenerationConfig,
    request_id: str,
) -> Iterator[List[engine_base.CallbackStreamOutput]]:

Import

from mlc_llm.serve.engine import MLCEngine

# _generate is an internal method accessed via an engine instance:
engine = MLCEngine(model="path/to/model")
# engine._generate(prompt, generation_config, request_id)

I/O Contract

Inputs

Name Type Required Description
prompt Union[str, List[int], List[Union[str, List[int], data.Data]]] Yes The input prompt, which can be a text string, a list of token IDs, or a list of mixed text strings, token ID lists, and data.Data instances (for multimodal inputs).
generation_config GenerationConfig Yes Configuration controlling generation behavior, including n (number of parallel generations), temperature, top_p, max_tokens, stop sequences, and other sampling parameters.
request_id str Yes A unique string identifier for this generation request, used for tracking, logging, and abort operations.

Outputs

Name Type Description
request_output Iterator[List[CallbackStreamOutput]] An iterator that yields lists of CallbackStreamOutput objects. Each CallbackStreamOutput contains: delta_text (incremental generated text), delta_logprob_json_strs (optional log probability data), finish_reason (one of "stop", "length", or None), and request_final_usage_json_str (usage stats on the final chunk).

Usage Examples

Basic Usage

from mlc_llm.serve.engine import MLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig

engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

# Create a generation config
gen_config = GenerationConfig(
    temperature=0.7,
    top_p=0.95,
    max_tokens=128,
    n=1,
)

# Use the internal _generate method for streaming
request_id = "req-001"
for output_list in engine._generate(
    prompt="Once upon a time",
    generation_config=gen_config,
    request_id=request_id,
):
    for output in output_list:
        if output.delta_text:
            print(output.delta_text, end="", flush=True)
        if output.finish_reason is not None:
            print(f"\n[Finished: {output.finish_reason}]")
        if output.request_final_usage_json_str is not None:
            print(f"[Usage: {output.request_final_usage_json_str}]")

engine.terminate()

Multiple Parallel Generations

from mlc_llm.serve.engine import MLCEngine
from mlc_llm.protocol.generation_config import GenerationConfig

engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

gen_config = GenerationConfig(
    temperature=1.0,
    max_tokens=64,
    n=3,  # Generate 3 parallel completions
)

# Collect outputs for each parallel generation
texts = [""] * 3
for output_list in engine._generate(
    prompt="The meaning of life is",
    generation_config=gen_config,
    request_id="req-parallel-001",
):
    for i, output in enumerate(output_list):
        if i < len(texts) and output.delta_text:
            texts[i] += output.delta_text

for i, text in enumerate(texts):
    print(f"Generation {i}: {text}")

engine.terminate()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment