Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm ChatCompletion Create

From Leeroopedia
Revision as of 15:48, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Mlc_ai_Mlc_llm_ChatCompletion_Create.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Deep_Learning, LLM_Inference
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for providing an OpenAI-compatible chat completion interface for multi-turn conversations with system/user/assistant message roles, provided by MLC-LLM.

Description

ChatCompletion.create is the synchronous chat completion method exposed through MLCEngine.chat.completions.create(). It accepts a list of conversation messages and generation parameters, then delegates to the engine's internal _chat_completion method. The method constructs a ChatCompletionRequest protocol object, applies the model's conversation template to format the prompt, tokenizes the input, and invokes the underlying generation engine. Depending on the stream parameter, it returns either a complete ChatCompletionResponse or an Iterator of ChatCompletionStreamResponse chunks.

The method provides full compatibility with the OpenAI Chat Completion API specification, including support for function/tool calling, logprob reporting, logit bias injection, and structured response formats.

Usage

Use ChatCompletion.create when performing synchronous chat-based inference with MLCEngine. This is the primary interface for generating assistant responses from a sequence of conversation messages. Set stream=True for incremental output delivery or leave it as False (default) for a single complete response.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/serve/engine.py (lines 369-442)

Signature

def create(
    self,
    *,
    messages: List[Dict[str, Any]],
    model: Optional[str] = None,
    frequency_penalty: Optional[float] = None,
    presence_penalty: Optional[float] = None,
    logprobs: bool = False,
    top_logprobs: int = 0,
    logit_bias: Optional[Dict[int, float]] = None,
    max_tokens: Optional[int] = None,
    n: int = 1,
    seed: Optional[int] = None,
    stop: Optional[Union[str, List[str]]] = None,
    stream: bool = False,
    stream_options: Optional[Dict[str, Any]] = None,
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    tools: Optional[List[Dict[str, Any]]] = None,
    tool_choice: Optional[Union[Literal["none", "auto"], Dict]] = None,
    user: Optional[str] = None,
    response_format: Optional[Dict[str, Any]] = None,
    request_id: Optional[str] = None,
    extra_body: Optional[Dict[str, Any]] = None,
) -> Union[
    Iterator[openai_api_protocol.ChatCompletionStreamResponse],
    openai_api_protocol.ChatCompletionResponse,
]:

Import

from mlc_llm.serve import MLCEngine

# Access via engine instance:
engine = MLCEngine(model="path/to/model")
engine.chat.completions.create(...)

I/O Contract

Inputs

Name Type Required Description
messages List[Dict[str, Any]] Yes A list of message dictionaries, each containing "role" (one of "system", "user", "assistant", "tool") and "content" (the message text or structured content).
model Optional[str] No Model identifier. If None, uses the engine's loaded model.
frequency_penalty Optional[float] No Penalizes tokens based on their frequency in the text so far. Range: [-2.0, 2.0].
presence_penalty Optional[float] No Penalizes tokens based on whether they have appeared in the text so far. Range: [-2.0, 2.0].
logprobs bool No Whether to return log probabilities of output tokens. Defaults to False.
top_logprobs int No Number of most likely tokens to return log probabilities for at each position. Requires logprobs=True. Defaults to 0.
logit_bias Optional[Dict[int, float]] No A mapping from token IDs to bias values (-100 to 100) applied to logits before sampling.
max_tokens Optional[int] No Maximum number of tokens to generate. If None, uses the model's default.
n int No Number of chat completion choices to generate for each input message. Defaults to 1.
seed Optional[int] No Random seed for reproducible generation.
stop Optional[Union[str, List[str]]] No One or more sequences where the model will stop generating further tokens.
stream bool No If True, returns an iterator of partial message deltas. Defaults to False.
stream_options Optional[Dict[str, Any]] No Additional options for streaming (e.g., {"include_usage": True} to get usage stats in stream).
temperature Optional[float] No Sampling temperature. Higher values increase randomness. Range: [0, 2].
top_p Optional[float] No Nucleus sampling threshold. Only tokens comprising the top p probability mass are considered.
tools Optional[List[Dict[str, Any]]] No A list of tool definitions the model may call, each describing a function with name, description, and parameters.
tool_choice Optional[Union[Literal["none", "auto"], Dict]] No Controls whether the model calls a tool: "none" disables, "auto" lets the model decide, or a dict to force a specific tool.
user Optional[str] No A unique identifier representing the end-user for abuse monitoring.
response_format Optional[Dict[str, Any]] No Constrains the output format (e.g., {"type": "json_object"}).
request_id Optional[str] No An optional request identifier. If not provided, a random UUID prefixed with "chatcmpl-" is generated.
extra_body Optional[Dict[str, Any]] No Extra body options, such as {"debug_config": {...}} for debugging.

Outputs

Name Type Description
response ChatCompletionResponse When stream=False: a complete response object containing choices (each with a message), usage statistics, and metadata.
stream_response Iterator[ChatCompletionStreamResponse] When stream=True: an iterator yielding partial response chunks, each containing a delta with incremental content.

Usage Examples

Basic Usage

from mlc_llm.serve import MLCEngine

engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

# Non-streaming chat completion
response = engine.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to compute Fibonacci numbers."},
    ],
    max_tokens=512,
    temperature=0.7,
)

print(response.choices[0].message.content)
engine.terminate()

Streaming Usage

from mlc_llm.serve import MLCEngine

engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

# Streaming chat completion
for chunk in engine.chat.completions.create(
    messages=[
        {"role": "user", "content": "Explain the theory of relativity."},
    ],
    stream=True,
    max_tokens=256,
    temperature=0.5,
    top_p=0.9,
):
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()

engine.terminate()

Multi-Turn Conversation

from mlc_llm.serve import MLCEngine

engine = MLCEngine(model="dist/Llama-2-7b-chat-hf-q4f16_1-MLC")

conversation = [
    {"role": "system", "content": "You are a math tutor."},
    {"role": "user", "content": "What is a derivative?"},
]

# First turn
response = engine.chat.completions.create(messages=conversation, max_tokens=200)
assistant_msg = response.choices[0].message.content
conversation.append({"role": "assistant", "content": assistant_msg})

# Second turn
conversation.append({"role": "user", "content": "Can you give an example?"})
response = engine.chat.completions.create(messages=conversation, max_tokens=200)
print(response.choices[0].message.content)

engine.terminate()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment