Implementation:Mlc ai Mlc llm Process chat completion request
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, LLM_Inference |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for processing multiple inference requests concurrently using continuous batching and request preprocessing pipelines, provided by MLC-LLM.
Description
process_chat_completion_request is a module-level function in engine_base.py that serves as the preprocessing pipeline for chat completion requests. It takes a raw ChatCompletionRequest (conforming to the OpenAI API protocol) and transforms it into engine-ready data: tokenized prompts, a generation configuration, a function-calling flag, and the total prompt length.
The function performs the following steps in order:
- Records the request event: Logs the request receipt in the engine state for tracing.
- Validates unsupported fields: Checks that the request does not use parameters not yet supported by the engine.
- Validates message structure: Ensures the message list has valid role orderings and content types.
- Processes function calling: Detects tool/function call usage and updates the conversation template accordingly.
- Populates the conversation template: Iterates over messages, setting the system message and appending user/assistant turns. Appends a final
("assistant", None)entry to prompt the model for its next response. - Tokenizes the prompt: Applies the conversation template to produce a formatted prompt, then tokenizes it using the provided tokenizer function.
- Prepends system prefix tokens: If the conversation template defines system prefix token IDs, prepends them to the first prompt segment.
- Validates prompt length: Checks that the total prompt length does not exceed
max_input_sequence_length. - Constructs generation config: Builds the
GenerationConfigfrom the request parameters, incorporating model-specific stop token IDs and stop strings from the conversation template.
This function is used by both MLCEngine (synchronous) and AsyncMLCEngine (asynchronous) as the shared request preprocessing stage.
Usage
This function is called internally by the engine's chat completion handlers. It is the bridge between the OpenAI-compatible API layer and the engine's internal token-level generation interface. Understanding it is essential for debugging request processing issues, extending the API protocol, or customizing conversation template behavior.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/serve/engine_base.py(lines 682-775)
Signature
def process_chat_completion_request(
request: openai_api_protocol.ChatCompletionRequest,
request_id: str,
engine_state: EngineState,
model_config: Dict[str, Any],
f_tokenize: Callable[[str], List[int]],
max_input_sequence_length: int,
conv_template: Conversation,
) -> Tuple[List[Union[List[int], data.Data]], GenerationConfig, bool, int]:
Import
from mlc_llm.serve.engine_base import process_chat_completion_request
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| request | openai_api_protocol.ChatCompletionRequest |
Yes | The chat completion request object containing messages, sampling parameters, and other settings conforming to the OpenAI API protocol. |
| request_id | str |
Yes | The unique identifier for this request, used for event tracing and logging. |
| engine_state | EngineState |
Yes | The engine state object that records request events and manages tracing. |
| model_config | Dict[str, Any] |
Yes | The model configuration dictionary, used when applying the conversation template to produce the formatted prompt. |
| f_tokenize | Callable[[str], List[int]] |
Yes | The tokenizer encode function that converts a text string into a list of token IDs. |
| max_input_sequence_length | int |
Yes | The maximum allowed total prompt length in tokens. Requests exceeding this length are rejected. |
| conv_template | Conversation |
Yes | The conversation template for the model, defining prompt formatting rules, system prefix tokens, stop token IDs, and stop strings. |
Outputs
| Name | Type | Description |
|---|---|---|
| prompts | List[Union[List[int], data.Data]] |
The processed and tokenized prompts. Each element is either a list of token IDs or a data.Data instance (for multimodal content).
|
| generation_cfg | GenerationConfig |
The generation configuration constructed from the request parameters, augmented with model-specific stop conditions. |
| use_function_calling | bool |
A boolean flag indicating whether the request uses function/tool calling. |
| prompt_length | int |
The total length of the tokenized prompt in tokens. |
Usage Examples
Basic Usage
from mlc_llm.serve.engine_base import (
EngineState,
process_chat_completion_request,
)
from mlc_llm.protocol.openai_api_protocol import (
ChatCompletionRequest,
ChatCompletionMessage,
)
from mlc_llm.protocol.conversation_protocol import Conversation
# Assume engine_state, model_config, tokenizer, and conv_template
# are already initialized by the engine.
request = ChatCompletionRequest(
messages=[
ChatCompletionMessage(role="system", content="You are a helpful assistant."),
ChatCompletionMessage(role="user", content="Hello, how are you?"),
],
temperature=0.7,
max_tokens=256,
)
prompts, generation_cfg, use_function_calling, prompt_length = (
process_chat_completion_request(
request=request,
request_id="chatcmpl-abc123",
engine_state=engine_state,
model_config=model_config,
f_tokenize=tokenizer.encode,
max_input_sequence_length=4096,
conv_template=conv_template,
)
)
print(f"Prompt length: {prompt_length} tokens")
print(f"Function calling: {use_function_calling}")
print(f"Generation config temperature: {generation_cfg.temperature}")
Usage with Tool Calling
from mlc_llm.protocol.openai_api_protocol import (
ChatCompletionRequest,
ChatCompletionMessage,
ChatTool,
)
request = ChatCompletionRequest(
messages=[
ChatCompletionMessage(role="user", content="What is the weather in Paris?"),
],
tools=[
ChatTool.model_validate({
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
},
"required": ["location"],
},
},
})
],
tool_choice="auto",
max_tokens=256,
)
prompts, generation_cfg, use_function_calling, prompt_length = (
process_chat_completion_request(
request=request,
request_id="chatcmpl-tool-001",
engine_state=engine_state,
model_config=model_config,
f_tokenize=tokenizer.encode,
max_input_sequence_length=4096,
conv_template=conv_template,
)
)
# use_function_calling will be True when tools are provided
print(f"Function calling enabled: {use_function_calling}")