Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Process chat completion request

From Leeroopedia
Revision as of 15:51, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Mlc_ai_Mlc_llm_Process_chat_completion_request.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Deep_Learning, LLM_Inference
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for processing multiple inference requests concurrently using continuous batching and request preprocessing pipelines, provided by MLC-LLM.

Description

process_chat_completion_request is a module-level function in engine_base.py that serves as the preprocessing pipeline for chat completion requests. It takes a raw ChatCompletionRequest (conforming to the OpenAI API protocol) and transforms it into engine-ready data: tokenized prompts, a generation configuration, a function-calling flag, and the total prompt length.

The function performs the following steps in order:

  1. Records the request event: Logs the request receipt in the engine state for tracing.
  2. Validates unsupported fields: Checks that the request does not use parameters not yet supported by the engine.
  3. Validates message structure: Ensures the message list has valid role orderings and content types.
  4. Processes function calling: Detects tool/function call usage and updates the conversation template accordingly.
  5. Populates the conversation template: Iterates over messages, setting the system message and appending user/assistant turns. Appends a final ("assistant", None) entry to prompt the model for its next response.
  6. Tokenizes the prompt: Applies the conversation template to produce a formatted prompt, then tokenizes it using the provided tokenizer function.
  7. Prepends system prefix tokens: If the conversation template defines system prefix token IDs, prepends them to the first prompt segment.
  8. Validates prompt length: Checks that the total prompt length does not exceed max_input_sequence_length.
  9. Constructs generation config: Builds the GenerationConfig from the request parameters, incorporating model-specific stop token IDs and stop strings from the conversation template.

This function is used by both MLCEngine (synchronous) and AsyncMLCEngine (asynchronous) as the shared request preprocessing stage.

Usage

This function is called internally by the engine's chat completion handlers. It is the bridge between the OpenAI-compatible API layer and the engine's internal token-level generation interface. Understanding it is essential for debugging request processing issues, extending the API protocol, or customizing conversation template behavior.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/serve/engine_base.py (lines 682-775)

Signature

def process_chat_completion_request(
    request: openai_api_protocol.ChatCompletionRequest,
    request_id: str,
    engine_state: EngineState,
    model_config: Dict[str, Any],
    f_tokenize: Callable[[str], List[int]],
    max_input_sequence_length: int,
    conv_template: Conversation,
) -> Tuple[List[Union[List[int], data.Data]], GenerationConfig, bool, int]:

Import

from mlc_llm.serve.engine_base import process_chat_completion_request

I/O Contract

Inputs

Name Type Required Description
request openai_api_protocol.ChatCompletionRequest Yes The chat completion request object containing messages, sampling parameters, and other settings conforming to the OpenAI API protocol.
request_id str Yes The unique identifier for this request, used for event tracing and logging.
engine_state EngineState Yes The engine state object that records request events and manages tracing.
model_config Dict[str, Any] Yes The model configuration dictionary, used when applying the conversation template to produce the formatted prompt.
f_tokenize Callable[[str], List[int]] Yes The tokenizer encode function that converts a text string into a list of token IDs.
max_input_sequence_length int Yes The maximum allowed total prompt length in tokens. Requests exceeding this length are rejected.
conv_template Conversation Yes The conversation template for the model, defining prompt formatting rules, system prefix tokens, stop token IDs, and stop strings.

Outputs

Name Type Description
prompts List[Union[List[int], data.Data]] The processed and tokenized prompts. Each element is either a list of token IDs or a data.Data instance (for multimodal content).
generation_cfg GenerationConfig The generation configuration constructed from the request parameters, augmented with model-specific stop conditions.
use_function_calling bool A boolean flag indicating whether the request uses function/tool calling.
prompt_length int The total length of the tokenized prompt in tokens.

Usage Examples

Basic Usage

from mlc_llm.serve.engine_base import (
    EngineState,
    process_chat_completion_request,
)
from mlc_llm.protocol.openai_api_protocol import (
    ChatCompletionRequest,
    ChatCompletionMessage,
)
from mlc_llm.protocol.conversation_protocol import Conversation

# Assume engine_state, model_config, tokenizer, and conv_template
# are already initialized by the engine.

request = ChatCompletionRequest(
    messages=[
        ChatCompletionMessage(role="system", content="You are a helpful assistant."),
        ChatCompletionMessage(role="user", content="Hello, how are you?"),
    ],
    temperature=0.7,
    max_tokens=256,
)

prompts, generation_cfg, use_function_calling, prompt_length = (
    process_chat_completion_request(
        request=request,
        request_id="chatcmpl-abc123",
        engine_state=engine_state,
        model_config=model_config,
        f_tokenize=tokenizer.encode,
        max_input_sequence_length=4096,
        conv_template=conv_template,
    )
)

print(f"Prompt length: {prompt_length} tokens")
print(f"Function calling: {use_function_calling}")
print(f"Generation config temperature: {generation_cfg.temperature}")

Usage with Tool Calling

from mlc_llm.protocol.openai_api_protocol import (
    ChatCompletionRequest,
    ChatCompletionMessage,
    ChatTool,
)

request = ChatCompletionRequest(
    messages=[
        ChatCompletionMessage(role="user", content="What is the weather in Paris?"),
    ],
    tools=[
        ChatTool.model_validate({
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string", "description": "City name"},
                    },
                    "required": ["location"],
                },
            },
        })
    ],
    tool_choice="auto",
    max_tokens=256,
)

prompts, generation_cfg, use_function_calling, prompt_length = (
    process_chat_completion_request(
        request=request,
        request_id="chatcmpl-tool-001",
        engine_state=engine_state,
        model_config=model_config,
        f_tokenize=tokenizer.encode,
        max_input_sequence_length=4096,
        conv_template=conv_template,
    )
)

# use_function_calling will be True when tools are provided
print(f"Function calling enabled: {use_function_calling}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment