Implementation:Mlc ai Mlc llm Request chat completion
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Serving, API_Design |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for handling OpenAI-compatible chat completion requests provided by MLC-LLM.
Description
The request_chat_completion function is a FastAPI route handler registered at POST /v1/chat/completions. It implements the full OpenAI Chat Completions API contract, supporting both streaming and non-streaming responses, function calling (tool use), logprobs, and usage statistics.
The function follows this execution flow:
- Server Context Lookup: Retrieves the
ServerContextsingleton and checks whether the requested model is being served. If debug mode is disabled, stripsdebug_configfrom the request. - Request ID Generation: Assigns a unique request ID. If the
userfield is set, it is used as the request ID (supporting distributed serving coordination). Otherwise, a UUID-based ID with"chatcmpl-"prefix is generated. - Streaming Path: If
request.streamisTrue, the function callsasync_engine._handle_chat_completion()to obtain an async generator. It eagerly fetches the first response to catch any immediate errors within the request scope (rather than in theStreamingResponsescope). It then wraps the generator in an SSE-formattedStreamingResponse, yielding each chunk asdata: {json}\n\nand terminating withdata: [DONE]\n\n. - Non-Streaming Path: If
request.streamisFalse, the function iterates over all stream outputs, accumulatingoutput_texts,finish_reasons,logprob_results, andusagestatistics. It checks for client disconnection on each iteration to abort gracefully. After collecting all outputs, it processes function call outputs (if applicable) and wraps everything in aChatCompletionResponse. - Function Call Post-Processing: For non-streaming responses, the accumulated output text is parsed for function calls using
engine_base.process_function_call_output(), which uses Python AST parsing to extract structured tool calls.
Usage
This function is not called directly by users. It is automatically invoked by FastAPI when a POST request is received at /v1/chat/completions. Clients interact with it via HTTP requests using the OpenAI SDK or any HTTP client library.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/serve/entrypoints/openai_entrypoints.py(Lines 141-247)
Signature
@app.post("/v1/chat/completions")
async def request_chat_completion(
request: ChatCompletionRequest,
raw_request: fastapi.Request,
) -> Union[fastapi.responses.StreamingResponse, ChatCompletionResponse]:
"""OpenAI-compatible chat completion API.
API reference: https://platform.openai.com/docs/api-reference/chat
"""
Import
# This is a FastAPI route handler, registered via the router:
from mlc_llm.serve.entrypoints.openai_entrypoints import app
# The router is included in the FastAPI application in serve.py:
# fastapi_app.include_router(openai_entrypoints.app)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| request | ChatCompletionRequest |
Yes | The OpenAI-compatible chat completion request body, automatically parsed by FastAPI from the JSON request body. Key fields include: messages (list of chat messages), model (model identifier), stream (bool), temperature, top_p, max_tokens, n (number of completions), stop, tools, tool_choice, logprobs, top_logprobs, response_format, stream_options, seed, frequency_penalty, presence_penalty.
|
| raw_request | fastapi.Request |
Yes | The raw FastAPI request object. Used to check for client disconnection during non-streaming responses via raw_request.is_disconnected().
|
Outputs
| Name | Type | Description |
|---|---|---|
| response (streaming) | fastapi.responses.StreamingResponse |
When stream=True: an SSE stream of ChatCompletionStreamResponse objects serialized as JSON, terminated by data: [DONE]. Content type is text/event-stream.
|
| response (non-streaming) | ChatCompletionResponse |
When stream=False: a single JSON response containing choices (with message, finish_reason, optional logprobs, optional tool_calls), usage (prompt_tokens, completion_tokens, total_tokens), model, and id.
|
| error response | JSON error |
Returned with HTTP 400 status when the requested model is not served or the client disconnects during processing. |
Usage Examples
Basic Usage with OpenAI SDK
from openai import OpenAI
# Point the OpenAI client at the local MLC-LLM server
client = OpenAI(
base_url="http://127.0.0.1:8000/v1",
api_key="not-needed",
)
# Non-streaming chat completion
response = client.chat.completions.create(
model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)
Streaming Response
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8000/v1",
api_key="not-needed",
)
# Streaming chat completion
stream = client.chat.completions.create(
model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
messages=[
{"role": "user", "content": "Write a short poem about coding."},
],
stream=True,
stream_options={"include_usage": True},
)
for chunk in stream:
if chunk.choices:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
if chunk.usage:
print(f"\nUsage: {chunk.usage}")
Function Calling (Tool Use)
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8000/v1",
api_key="not-needed",
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name.",
},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
messages=[
{"role": "user", "content": "What is the weather in Paris?"},
],
tools=tools,
tool_choice="auto",
)
if response.choices[0].finish_reason == "tool_calls":
for tool_call in response.choices[0].message.tool_calls:
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Direct HTTP Request with curl
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "dist/models/Llama-2-7b-chat-hf-q4f16_1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 100,
"stream": false
}'