Implementation:Mlc ai Mlc llm ChatCompletion Create
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, LLM_Inference |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for providing an OpenAI-compatible chat completion interface for multi-turn conversations with system/user/assistant message roles, provided by MLC-LLM.
Description
ChatCompletion.create is the synchronous chat completion method exposed through MLCEngine.chat.completions.create(). It accepts a list of conversation messages and generation parameters, then delegates to the engine's internal _chat_completion method. The method constructs a ChatCompletionRequest protocol object, applies the model's conversation template to format the prompt, tokenizes the input, and invokes the underlying generation engine. Depending on the stream parameter, it returns either a complete ChatCompletionResponse or an Iterator of ChatCompletionStreamResponse chunks.
The method provides full compatibility with the OpenAI Chat Completion API specification, including support for function/tool calling, logprob reporting, logit bias injection, and structured response formats.
Usage
Use ChatCompletion.create when performing synchronous chat-based inference with MLCEngine. This is the primary interface for generating assistant responses from a sequence of conversation messages. Set stream=True for incremental output delivery or leave it as False (default) for a single complete response.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/serve/engine.py(lines 369-442)
Signature
def create(
self,
*,
messages: List[Dict[str, Any]],
model: Optional[str] = None,
frequency_penalty: Optional[float] = None,
presence_penalty: Optional[float] = None,
logprobs: bool = False,
top_logprobs: int = 0,
logit_bias: Optional[Dict[int, float]] = None,
max_tokens: Optional[int] = None,
n: int = 1,
seed: Optional[int] = None,
stop: Optional[Union[str, List[str]]] = None,
stream: bool = False,
stream_options: Optional[Dict[str, Any]] = None,
temperature: Optional[float] = None,
top_p: Optional[float] = None,
tools: Optional[List[Dict[str, Any]]] = None,
tool_choice: Optional[Union[Literal["none", "auto"], Dict]] = None,
user: Optional[str] = None,
response_format: Optional[Dict[str, Any]] = None,
request_id: Optional[str] = None,
extra_body: Optional[Dict[str, Any]] = None,
) -> Union[
Iterator[openai_api_protocol.ChatCompletionStreamResponse],
openai_api_protocol.ChatCompletionResponse,
]:
Import
from mlc_llm.serve import MLCEngine
# Access via engine instance:
engine = MLCEngine(model="path/to/model")
engine.chat.completions.create(...)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| messages | List[Dict[str, Any]] |
Yes | A list of message dictionaries, each containing "role" (one of "system", "user", "assistant", "tool") and "content" (the message text or structured content).
|
| model | Optional[str] |
No | Model identifier. If None, uses the engine's loaded model.
|
| frequency_penalty | Optional[float] |
No | Penalizes tokens based on their frequency in the text so far. Range: [-2.0, 2.0].
|
| presence_penalty | Optional[float] |
No | Penalizes tokens based on whether they have appeared in the text so far. Range: [-2.0, 2.0].
|
| logprobs | bool |
No | Whether to return log probabilities of output tokens. Defaults to False.
|
| top_logprobs | int |
No | Number of most likely tokens to return log probabilities for at each position. Requires logprobs=True. Defaults to 0.
|
| logit_bias | Optional[Dict[int, float]] |
No | A mapping from token IDs to bias values (-100 to 100) applied to logits before sampling.
|
| max_tokens | Optional[int] |
No | Maximum number of tokens to generate. If None, uses the model's default.
|
| n | int |
No | Number of chat completion choices to generate for each input message. Defaults to 1.
|
| seed | Optional[int] |
No | Random seed for reproducible generation. |
| stop | Optional[Union[str, List[str]]] |
No | One or more sequences where the model will stop generating further tokens. |
| stream | bool |
No | If True, returns an iterator of partial message deltas. Defaults to False.
|
| stream_options | Optional[Dict[str, Any]] |
No | Additional options for streaming (e.g., {"include_usage": True} to get usage stats in stream).
|
| temperature | Optional[float] |
No | Sampling temperature. Higher values increase randomness. Range: [0, 2].
|
| top_p | Optional[float] |
No | Nucleus sampling threshold. Only tokens comprising the top p probability mass are considered.
|
| tools | Optional[List[Dict[str, Any]]] |
No | A list of tool definitions the model may call, each describing a function with name, description, and parameters. |
| tool_choice | Optional[Union[Literal["none", "auto"], Dict]] |
No | Controls whether the model calls a tool: "none" disables, "auto" lets the model decide, or a dict to force a specific tool.
|
| user | Optional[str] |
No | A unique identifier representing the end-user for abuse monitoring. |
| response_format | Optional[Dict[str, Any]] |
No | Constrains the output format (e.g., {"type": "json_object"}).
|
| request_id | Optional[str] |
No | An optional request identifier. If not provided, a random UUID prefixed with "chatcmpl-" is generated.
|
| extra_body | Optional[Dict[str, Any]] |
No | Extra body options, such as {"debug_config": {...}} for debugging.
|
Outputs
| Name | Type | Description |
|---|---|---|
| response | ChatCompletionResponse |
When stream=False: a complete response object containing choices (each with a message), usage statistics, and metadata.
|
| stream_response | Iterator[ChatCompletionStreamResponse] |
When stream=True: an iterator yielding partial response chunks, each containing a delta with incremental content.
|
Usage Examples
Basic Usage
from mlc_llm.serve import MLCEngine
engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")
# Non-streaming chat completion
response = engine.chat.completions.create(
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to compute Fibonacci numbers."},
],
max_tokens=512,
temperature=0.7,
)
print(response.choices[0].message.content)
engine.terminate()
Streaming Usage
from mlc_llm.serve import MLCEngine
engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")
# Streaming chat completion
for chunk in engine.chat.completions.create(
messages=[
{"role": "user", "content": "Explain the theory of relativity."},
],
stream=True,
max_tokens=256,
temperature=0.5,
top_p=0.9,
):
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
print()
engine.terminate()
Multi-Turn Conversation
from mlc_llm.serve import MLCEngine
engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")
conversation = [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is a derivative?"},
]
# First turn
response = engine.chat.completions.create(messages=conversation, max_tokens=200)
assistant_msg = response.choices[0].message.content
conversation.append({"role": "assistant", "content": assistant_msg})
# Second turn
conversation.append({"role": "user", "content": "Can you give an example?"})
response = engine.chat.completions.create(messages=conversation, max_tokens=200)
print(response.choices[0].message.content)
engine.terminate()