Implementation:Mlc ai Mlc llm ChatCompletion Create

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, LLM_Inference
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for providing an OpenAI-compatible chat completion interface for multi-turn conversations with system/user/assistant message roles, provided by MLC-LLM.

Description

ChatCompletion.create is the synchronous chat completion method exposed through MLCEngine.chat.completions.create(). It accepts a list of conversation messages and generation parameters, then delegates to the engine's internal _chat_completion method. The method constructs a ChatCompletionRequest protocol object, applies the model's conversation template to format the prompt, tokenizes the input, and invokes the underlying generation engine. Depending on the stream parameter, it returns either a complete ChatCompletionResponse or an Iterator of ChatCompletionStreamResponse chunks.

The method provides full compatibility with the OpenAI Chat Completion API specification, including support for function/tool calling, logprob reporting, logit bias injection, and structured response formats.

Usage

Use ChatCompletion.create when performing synchronous chat-based inference with MLCEngine. This is the primary interface for generating assistant responses from a sequence of conversation messages. Set stream=True for incremental output delivery or leave it as False (default) for a single complete response.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/serve/engine.py (lines 369-442)

Signature

def create(
    self,
    *,
    messages: List[Dict[str, Any]],
    model: Optional[str] = None,
    frequency_penalty: Optional[float] = None,
    presence_penalty: Optional[float] = None,
    logprobs: bool = False,
    top_logprobs: int = 0,
    logit_bias: Optional[Dict[int, float]] = None,
    max_tokens: Optional[int] = None,
    n: int = 1,
    seed: Optional[int] = None,
    stop: Optional[Union[str, List[str]]] = None,
    stream: bool = False,
    stream_options: Optional[Dict[str, Any]] = None,
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    tools: Optional[List[Dict[str, Any]]] = None,
    tool_choice: Optional[Union[Literal["none", "auto"], Dict]] = None,
    user: Optional[str] = None,
    response_format: Optional[Dict[str, Any]] = None,
    request_id: Optional[str] = None,
    extra_body: Optional[Dict[str, Any]] = None,
) -> Union[
    Iterator[openai_api_protocol.ChatCompletionStreamResponse],
    openai_api_protocol.ChatCompletionResponse,
]:

Import

from mlc_llm.serve import MLCEngine

# Access via engine instance:
engine = MLCEngine(model="path/to/model")
engine.chat.completions.create(...)

I/O Contract

Inputs

Name	Type	Required	Description
messages	`List[Dict[str, Any]]`	Yes	A list of message dictionaries, each containing `"role"` (one of `"system"`, `"user"`, `"assistant"`, `"tool"`) and `"content"` (the message text or structured content).
model	`Optional[str]`	No	Model identifier. If `None`, uses the engine's loaded model.
frequency_penalty	`Optional[float]`	No	Penalizes tokens based on their frequency in the text so far. Range: `[-2.0, 2.0]`.
presence_penalty	`Optional[float]`	No	Penalizes tokens based on whether they have appeared in the text so far. Range: `[-2.0, 2.0]`.
logprobs	`bool`	No	Whether to return log probabilities of output tokens. Defaults to `False`.
top_logprobs	`int`	No	Number of most likely tokens to return log probabilities for at each position. Requires `logprobs=True`. Defaults to `0`.
logit_bias	`Optional[Dict[int, float]]`	No	A mapping from token IDs to bias values (`-100` to `100`) applied to logits before sampling.
max_tokens	`Optional[int]`	No	Maximum number of tokens to generate. If `None`, uses the model's default.
n	`int`	No	Number of chat completion choices to generate for each input message. Defaults to `1`.
seed	`Optional[int]`	No	Random seed for reproducible generation.
stop	`Optional[Union[str, List[str]]]`	No	One or more sequences where the model will stop generating further tokens.
stream	`bool`	No	If `True`, returns an iterator of partial message deltas. Defaults to `False`.
stream_options	`Optional[Dict[str, Any]]`	No	Additional options for streaming (e.g., `{"include_usage": True}` to get usage stats in stream).
temperature	`Optional[float]`	No	Sampling temperature. Higher values increase randomness. Range: `[0, 2]`.
top_p	`Optional[float]`	No	Nucleus sampling threshold. Only tokens comprising the top `p` probability mass are considered.
tools	`Optional[List[Dict[str, Any]]]`	No	A list of tool definitions the model may call, each describing a function with name, description, and parameters.
tool_choice	`Optional[Union[Literal["none", "auto"], Dict]]`	No	Controls whether the model calls a tool: `"none"` disables, `"auto"` lets the model decide, or a dict to force a specific tool.
user	`Optional[str]`	No	A unique identifier representing the end-user for abuse monitoring.
response_format	`Optional[Dict[str, Any]]`	No	Constrains the output format (e.g., `{"type": "json_object"}`).
request_id	`Optional[str]`	No	An optional request identifier. If not provided, a random UUID prefixed with `"chatcmpl-"` is generated.
extra_body	`Optional[Dict[str, Any]]`	No	Extra body options, such as `{"debug_config": {...}}` for debugging.

Outputs

Name	Type	Description
response	`ChatCompletionResponse`	When `stream=False`: a complete response object containing `choices` (each with a `message`), `usage` statistics, and metadata.
stream_response	`Iterator[ChatCompletionStreamResponse]`	When `stream=True`: an iterator yielding partial response chunks, each containing a `delta` with incremental content.

Usage Examples

Basic Usage

from mlc_llm.serve import MLCEngine

engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

# Non-streaming chat completion
response = engine.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to compute Fibonacci numbers."},
    ],
    max_tokens=512,
    temperature=0.7,
)

print(response.choices[0].message.content)
engine.terminate()

Streaming Usage

from mlc_llm.serve import MLCEngine

engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

# Streaming chat completion
for chunk in engine.chat.completions.create(
    messages=[
        {"role": "user", "content": "Explain the theory of relativity."},
    ],
    stream=True,
    max_tokens=256,
    temperature=0.5,
    top_p=0.9,
):
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()

engine.terminate()

Multi-Turn Conversation

from mlc_llm.serve import MLCEngine

engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

conversation = [
    {"role": "system", "content": "You are a math tutor."},
    {"role": "user", "content": "What is a derivative?"},
]

# First turn
response = engine.chat.completions.create(messages=conversation, max_tokens=200)
assistant_msg = response.choices[0].message.content
conversation.append({"role": "assistant", "content": assistant_msg})

# Second turn
conversation.append({"role": "user", "content": "Can you give an example?"})
response = engine.chat.completions.create(messages=conversation, max_tokens=200)
print(response.choices[0].message.content)

engine.terminate()

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Chat_Completion_Interface

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment