Implementation:Mlc ai Mlc llm Request chat completion

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Model_Serving, API_Design
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for handling OpenAI-compatible chat completion requests provided by MLC-LLM.

Description

The request_chat_completion function is a FastAPI route handler registered at POST /v1/chat/completions. It implements the full OpenAI Chat Completions API contract, supporting both streaming and non-streaming responses, function calling (tool use), logprobs, and usage statistics.

The function follows this execution flow:

Server Context Lookup: Retrieves the ServerContext singleton and checks whether the requested model is being served. If debug mode is disabled, strips debug_config from the request.
Request ID Generation: Assigns a unique request ID. If the user field is set, it is used as the request ID (supporting distributed serving coordination). Otherwise, a UUID-based ID with "chatcmpl-" prefix is generated.
Streaming Path: If request.stream is True, the function calls async_engine._handle_chat_completion() to obtain an async generator. It eagerly fetches the first response to catch any immediate errors within the request scope (rather than in the StreamingResponse scope). It then wraps the generator in an SSE-formatted StreamingResponse, yielding each chunk as data: {json}\n\n and terminating with data: [DONE]\n\n.
Non-Streaming Path: If request.stream is False, the function iterates over all stream outputs, accumulating output_texts, finish_reasons, logprob_results, and usage statistics. It checks for client disconnection on each iteration to abort gracefully. After collecting all outputs, it processes function call outputs (if applicable) and wraps everything in a ChatCompletionResponse.
Function Call Post-Processing: For non-streaming responses, the accumulated output text is parsed for function calls using engine_base.process_function_call_output(), which uses Python AST parsing to extract structured tool calls.

Usage

This function is not called directly by users. It is automatically invoked by FastAPI when a POST request is received at /v1/chat/completions. Clients interact with it via HTTP requests using the OpenAI SDK or any HTTP client library.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/serve/entrypoints/openai_entrypoints.py (Lines 141-247)

Signature

@app.post("/v1/chat/completions")
async def request_chat_completion(
    request: ChatCompletionRequest,
    raw_request: fastapi.Request,
) -> Union[fastapi.responses.StreamingResponse, ChatCompletionResponse]:
    """OpenAI-compatible chat completion API.
    API reference: https://platform.openai.com/docs/api-reference/chat
    """

Import

# This is a FastAPI route handler, registered via the router:
from mlc_llm.serve.entrypoints.openai_entrypoints import app

# The router is included in the FastAPI application in serve.py:
# fastapi_app.include_router(openai_entrypoints.app)

I/O Contract

Inputs

Name	Type	Required	Description
request	`ChatCompletionRequest`	Yes	The OpenAI-compatible chat completion request body, automatically parsed by FastAPI from the JSON request body. Key fields include: `messages` (list of chat messages), `model` (model identifier), `stream` (bool), `temperature`, `top_p`, `max_tokens`, `n` (number of completions), `stop`, `tools`, `tool_choice`, `logprobs`, `top_logprobs`, `response_format`, `stream_options`, `seed`, `frequency_penalty`, `presence_penalty`.
raw_request	`fastapi.Request`	Yes	The raw FastAPI request object. Used to check for client disconnection during non-streaming responses via `raw_request.is_disconnected()`.

Outputs

Name	Type	Description
response (streaming)	`fastapi.responses.StreamingResponse`	When `stream=True`: an SSE stream of `ChatCompletionStreamResponse` objects serialized as JSON, terminated by `data: [DONE]`. Content type is `text/event-stream`.
response (non-streaming)	`ChatCompletionResponse`	When `stream=False`: a single JSON response containing `choices` (with `message`, `finish_reason`, optional `logprobs`, optional `tool_calls`), `usage` (prompt_tokens, completion_tokens, total_tokens), `model`, and `id`.
error response	`JSON error`	Returned with HTTP 400 status when the requested model is not served or the client disconnects during processing.

Usage Examples

Basic Usage with OpenAI SDK

from openai import OpenAI

# Point the OpenAI client at the local MLC-LLM server
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="not-needed",
)

# Non-streaming chat completion
response = client.chat.completions.create(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0.7,
    max_tokens=256,
)
print(response.choices[0].message.content)

Streaming Response

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="not-needed",
)

# Streaming chat completion
stream = client.chat.completions.create(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    messages=[
        {"role": "user", "content": "Write a short poem about coding."},
    ],
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in stream:
    if chunk.choices:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)
    if chunk.usage:
        print(f"\nUsage: {chunk.usage}")

Function Calling (Tool Use)

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="not-needed",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name.",
                    },
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    messages=[
        {"role": "user", "content": "What is the weather in Paris?"},
    ],
    tools=tools,
    tool_choice="auto",
)

if response.choices[0].finish_reason == "tool_calls":
    for tool_call in response.choices[0].message.tool_calls:
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

Direct HTTP Request with curl

curl http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "dist/models/Llama-2-7b-chat-hf-q4f16_1",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello!"}
        ],
        "temperature": 0.7,
        "max_tokens": 100,
        "stream": false
    }'

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_OpenAI_API_Integration

Environment Links

Environment:Mlc_ai_Mlc_llm_Python_Serving_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment