Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Mlc llm Request chat completion

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Serving, API_Design
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for handling OpenAI-compatible chat completion requests provided by MLC-LLM.

Description

The request_chat_completion function is a FastAPI route handler registered at POST /v1/chat/completions. It implements the full OpenAI Chat Completions API contract, supporting both streaming and non-streaming responses, function calling (tool use), logprobs, and usage statistics.

The function follows this execution flow:

  1. Server Context Lookup: Retrieves the ServerContext singleton and checks whether the requested model is being served. If debug mode is disabled, strips debug_config from the request.
  2. Request ID Generation: Assigns a unique request ID. If the user field is set, it is used as the request ID (supporting distributed serving coordination). Otherwise, a UUID-based ID with "chatcmpl-" prefix is generated.
  3. Streaming Path: If request.stream is True, the function calls async_engine._handle_chat_completion() to obtain an async generator. It eagerly fetches the first response to catch any immediate errors within the request scope (rather than in the StreamingResponse scope). It then wraps the generator in an SSE-formatted StreamingResponse, yielding each chunk as data: {json}\n\n and terminating with data: [DONE]\n\n.
  4. Non-Streaming Path: If request.stream is False, the function iterates over all stream outputs, accumulating output_texts, finish_reasons, logprob_results, and usage statistics. It checks for client disconnection on each iteration to abort gracefully. After collecting all outputs, it processes function call outputs (if applicable) and wraps everything in a ChatCompletionResponse.
  5. Function Call Post-Processing: For non-streaming responses, the accumulated output text is parsed for function calls using engine_base.process_function_call_output(), which uses Python AST parsing to extract structured tool calls.

Usage

This function is not called directly by users. It is automatically invoked by FastAPI when a POST request is received at /v1/chat/completions. Clients interact with it via HTTP requests using the OpenAI SDK or any HTTP client library.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/serve/entrypoints/openai_entrypoints.py (Lines 141-247)

Signature

@app.post("/v1/chat/completions")
async def request_chat_completion(
    request: ChatCompletionRequest,
    raw_request: fastapi.Request,
) -> Union[fastapi.responses.StreamingResponse, ChatCompletionResponse]:
    """OpenAI-compatible chat completion API.
    API reference: https://platform.openai.com/docs/api-reference/chat
    """

Import

# This is a FastAPI route handler, registered via the router:
from mlc_llm.serve.entrypoints.openai_entrypoints import app

# The router is included in the FastAPI application in serve.py:
# fastapi_app.include_router(openai_entrypoints.app)

I/O Contract

Inputs

Name Type Required Description
request ChatCompletionRequest Yes The OpenAI-compatible chat completion request body, automatically parsed by FastAPI from the JSON request body. Key fields include: messages (list of chat messages), model (model identifier), stream (bool), temperature, top_p, max_tokens, n (number of completions), stop, tools, tool_choice, logprobs, top_logprobs, response_format, stream_options, seed, frequency_penalty, presence_penalty.
raw_request fastapi.Request Yes The raw FastAPI request object. Used to check for client disconnection during non-streaming responses via raw_request.is_disconnected().

Outputs

Name Type Description
response (streaming) fastapi.responses.StreamingResponse When stream=True: an SSE stream of ChatCompletionStreamResponse objects serialized as JSON, terminated by data: [DONE]. Content type is text/event-stream.
response (non-streaming) ChatCompletionResponse When stream=False: a single JSON response containing choices (with message, finish_reason, optional logprobs, optional tool_calls), usage (prompt_tokens, completion_tokens, total_tokens), model, and id.
error response JSON error Returned with HTTP 400 status when the requested model is not served or the client disconnects during processing.

Usage Examples

Basic Usage with OpenAI SDK

from openai import OpenAI

# Point the OpenAI client at the local MLC-LLM server
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="not-needed",
)

# Non-streaming chat completion
response = client.chat.completions.create(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0.7,
    max_tokens=256,
)
print(response.choices[0].message.content)

Streaming Response

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="not-needed",
)

# Streaming chat completion
stream = client.chat.completions.create(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    messages=[
        {"role": "user", "content": "Write a short poem about coding."},
    ],
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in stream:
    if chunk.choices:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)
    if chunk.usage:
        print(f"\nUsage: {chunk.usage}")

Function Calling (Tool Use)

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="not-needed",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name.",
                    },
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    messages=[
        {"role": "user", "content": "What is the weather in Paris?"},
    ],
    tools=tools,
    tool_choice="auto",
)

if response.choices[0].finish_reason == "tool_calls":
    for tool_call in response.choices[0].message.tool_calls:
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")

Direct HTTP Request with curl

curl http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "dist/models/Llama-2-7b-chat-hf-q4f16_1",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello!"}
        ],
        "temperature": 0.7,
        "max_tokens": 100,
        "stream": false
    }'

Related Pages

Implements Principle

Environment Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment