Implementation:Mlc ai Mlc llm Process chat completion request

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, LLM_Inference
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for processing multiple inference requests concurrently using continuous batching and request preprocessing pipelines, provided by MLC-LLM.

Description

process_chat_completion_request is a module-level function in engine_base.py that serves as the preprocessing pipeline for chat completion requests. It takes a raw ChatCompletionRequest (conforming to the OpenAI API protocol) and transforms it into engine-ready data: tokenized prompts, a generation configuration, a function-calling flag, and the total prompt length.

The function performs the following steps in order:

Records the request event: Logs the request receipt in the engine state for tracing.
Validates unsupported fields: Checks that the request does not use parameters not yet supported by the engine.
Validates message structure: Ensures the message list has valid role orderings and content types.
Processes function calling: Detects tool/function call usage and updates the conversation template accordingly.
Populates the conversation template: Iterates over messages, setting the system message and appending user/assistant turns. Appends a final ("assistant", None) entry to prompt the model for its next response.
Tokenizes the prompt: Applies the conversation template to produce a formatted prompt, then tokenizes it using the provided tokenizer function.
Prepends system prefix tokens: If the conversation template defines system prefix token IDs, prepends them to the first prompt segment.
Validates prompt length: Checks that the total prompt length does not exceed max_input_sequence_length.
Constructs generation config: Builds the GenerationConfig from the request parameters, incorporating model-specific stop token IDs and stop strings from the conversation template.

This function is used by both MLCEngine (synchronous) and AsyncMLCEngine (asynchronous) as the shared request preprocessing stage.

Usage

This function is called internally by the engine's chat completion handlers. It is the bridge between the OpenAI-compatible API layer and the engine's internal token-level generation interface. Understanding it is essential for debugging request processing issues, extending the API protocol, or customizing conversation template behavior.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/serve/engine_base.py (lines 682-775)

Signature

def process_chat_completion_request(
    request: openai_api_protocol.ChatCompletionRequest,
    request_id: str,
    engine_state: EngineState,
    model_config: Dict[str, Any],
    f_tokenize: Callable[[str], List[int]],
    max_input_sequence_length: int,
    conv_template: Conversation,
) -> Tuple[List[Union[List[int], data.Data]], GenerationConfig, bool, int]:

Import

from mlc_llm.serve.engine_base import process_chat_completion_request

I/O Contract

Inputs

Name	Type	Required	Description
request	`openai_api_protocol.ChatCompletionRequest`	Yes	The chat completion request object containing messages, sampling parameters, and other settings conforming to the OpenAI API protocol.
request_id	`str`	Yes	The unique identifier for this request, used for event tracing and logging.
engine_state	`EngineState`	Yes	The engine state object that records request events and manages tracing.
model_config	`Dict[str, Any]`	Yes	The model configuration dictionary, used when applying the conversation template to produce the formatted prompt.
f_tokenize	`Callable[[str], List[int]]`	Yes	The tokenizer encode function that converts a text string into a list of token IDs.
max_input_sequence_length	`int`	Yes	The maximum allowed total prompt length in tokens. Requests exceeding this length are rejected.
conv_template	`Conversation`	Yes	The conversation template for the model, defining prompt formatting rules, system prefix tokens, stop token IDs, and stop strings.

Outputs

Name	Type	Description
prompts	`List[Union[List[int], data.Data]]`	The processed and tokenized prompts. Each element is either a list of token IDs or a `data.Data` instance (for multimodal content).
generation_cfg	`GenerationConfig`	The generation configuration constructed from the request parameters, augmented with model-specific stop conditions.
use_function_calling	`bool`	A boolean flag indicating whether the request uses function/tool calling.
prompt_length	`int`	The total length of the tokenized prompt in tokens.

Usage Examples

Basic Usage

from mlc_llm.serve.engine_base import (
    EngineState,
    process_chat_completion_request,
)
from mlc_llm.protocol.openai_api_protocol import (
    ChatCompletionRequest,
    ChatCompletionMessage,
)
from mlc_llm.protocol.conversation_protocol import Conversation

# Assume engine_state, model_config, tokenizer, and conv_template
# are already initialized by the engine.

request = ChatCompletionRequest(
    messages=[
        ChatCompletionMessage(role="system", content="You are a helpful assistant."),
        ChatCompletionMessage(role="user", content="Hello, how are you?"),
    ],
    temperature=0.7,
    max_tokens=256,
)

prompts, generation_cfg, use_function_calling, prompt_length = (
    process_chat_completion_request(
        request=request,
        request_id="chatcmpl-abc123",
        engine_state=engine_state,
        model_config=model_config,
        f_tokenize=tokenizer.encode,
        max_input_sequence_length=4096,
        conv_template=conv_template,
    )
)

print(f"Prompt length: {prompt_length} tokens")
print(f"Function calling: {use_function_calling}")
print(f"Generation config temperature: {generation_cfg.temperature}")

Usage with Tool Calling

from mlc_llm.protocol.openai_api_protocol import (
    ChatCompletionRequest,
    ChatCompletionMessage,
    ChatTool,
)

request = ChatCompletionRequest(
    messages=[
        ChatCompletionMessage(role="user", content="What is the weather in Paris?"),
    ],
    tools=[
        ChatTool.model_validate({
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string", "description": "City name"},
                    },
                    "required": ["location"],
                },
            },
        })
    ],
    tool_choice="auto",
    max_tokens=256,
)

prompts, generation_cfg, use_function_calling, prompt_length = (
    process_chat_completion_request(
        request=request,
        request_id="chatcmpl-tool-001",
        engine_state=engine_state,
        model_config=model_config,
        f_tokenize=tokenizer.encode,
        max_input_sequence_length=4096,
        conv_template=conv_template,
    )
)

# use_function_calling will be True when tools are provided
print(f"Function calling enabled: {use_function_calling}")

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Concurrent_Request_Handling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment