Principle:Mlc ai Mlc llm Concurrent Request Handling
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, LLM_Inference |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concurrent request handling is the technique of processing multiple inference requests simultaneously using continuous batching and request preprocessing pipelines, maximizing GPU utilization while maintaining per-request isolation.
Description
LLM inference engines must efficiently handle multiple requests that arrive at different times and require varying amounts of computation. Unlike traditional batch processing where all inputs are padded to the same length and processed together, concurrent request handling in modern LLM serving uses continuous batching (also called iteration-level scheduling), where new requests can join an in-flight batch at any decoding step and completed requests leave without stalling others.
Before a request enters the generation engine, it must pass through a preprocessing pipeline that:
- Validates the request: Checks that the message structure is well-formed, that unsupported parameters are not set, and that the request satisfies the engine's constraints.
- Applies the conversation template: Converts the structured message list (system/user/assistant roles) into the raw prompt format expected by the specific model being served.
- Handles function calling: Detects and processes tool/function calling directives, updating the conversation template accordingly.
- Tokenizes the prompt: Converts the formatted prompt string into token IDs. For multimodal inputs, this may also include embedding image or other non-text data.
- Validates prompt length: Ensures the total prompt length does not exceed the engine's configured
max_input_sequence_length. - Constructs generation config: Extracts sampling parameters (temperature, top_p, stop tokens, etc.) from the request and merges them with model-specific stop token IDs and stop strings from the conversation template.
This preprocessing stage is crucial because it is the boundary between the high-level API protocol (OpenAI-compatible request format) and the low-level engine internals (token ID sequences and generation configs). Each request is processed independently, allowing requests with different parameters, prompt lengths, and generation configurations to coexist in the same batch.
The engine state tracks active requests, manages event tracing, and provides callbacks for streaming output back to individual callers. This per-request isolation ensures that one request's failure or cancellation does not affect others.
Usage
Use concurrent request handling when:
- Building an inference server that must serve multiple users simultaneously.
- Maximizing GPU throughput by keeping the GPU busy with multiple requests rather than processing one at a time.
- Supporting heterogeneous workloads where different requests have different prompt lengths, generation parameters, and output length requirements.
- Implementing fair scheduling across requests with varying priorities or latency requirements.
Theoretical Basis
Traditional batched inference pads all sequences to the same length, wasting compute on padding tokens:
# Traditional static batching
Batch = [
[tok1, tok2, tok3, PAD, PAD, PAD ], # request A (3 tokens)
[tok1, tok2, tok3, tok4, tok5, PAD ], # request B (5 tokens)
[tok1, tok2, tok3, tok4, tok5, tok6], # request C (6 tokens)
]
# PAD tokens waste GPU compute
Continuous batching eliminates this waste by scheduling at the iteration (token) level:
# Continuous batching: requests enter and leave dynamically
Step 1: [A_prefill, B_prefill, C_prefill] # all three start
Step 2: [A_decode, B_decode, C_decode ] # all three decode
Step 3: [A_done, B_decode, C_decode, D_prefill] # A finishes, D joins
Step 4: [ B_decode, C_decode, D_decode ] # B, C, D continue
The request preprocessing pipeline converts an API-level request into engine-ready data:
function ProcessChatCompletionRequest(request, request_id, engine_state,
model_config, f_tokenize,
max_input_seq_len, conv_template):
# Step 1: Validate request fields
check_unsupported_fields(request)
request.check_message_validity()
# Step 2: Apply conversation template
request.check_function_call_usage(conv_template)
for message in request.messages:
if message.role == "system":
conv_template.system_message = message.content
else:
conv_template.messages.append((message.role, message.content))
conv_template.messages.append(("assistant", None))
# Step 3: Tokenize
prompts = process_prompts(conv_template.as_prompt(model_config), f_tokenize)
# Step 4: Prepend system prefix tokens if applicable
if conv_template.system_prefix_token_ids is not None:
prompts[0] = system_prefix_token_ids + prompts[0]
# Step 5: Validate prompt length
prompt_length = check_and_get_prompts_length(prompts, max_input_seq_len)
# Step 6: Build generation config with model-specific stop conditions
generation_cfg = get_generation_config(
request,
extra_stop_token_ids=conv_template.stop_token_ids,
extra_stop_str=conv_template.stop_str,
)
return prompts, generation_cfg, use_function_calling, prompt_length
This decomposition ensures that the request validation and prompt formatting logic is cleanly separated from the generation engine, enabling the same preprocessing pipeline to serve both synchronous and asynchronous engine variants.