Heuristic:Mlc ai Mlc llm Engine Mode Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deployment |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Decision guide for selecting the engine mode (`local`, `interactive`, `server`) which controls automatic inference of max batch size, KV cache capacity, and prefill chunk size.
Description
MLC-LLM provides three engine modes that control how the engine auto-configures its resource allocation. Each mode makes different trade-offs between memory usage, concurrency, and context length. The mode determines default values for `max_num_sequence` (batch size), `max_total_sequence_length` (KV cache capacity), and `prefill_chunk_size` when these are not explicitly set by the user. Users can always override any of these values manually.
Usage
Apply this heuristic when launching the engine (via `mlc_llm serve`, Python API, or JIT) and you need to decide which mode to use. The choice depends on the deployment scenario: single-user desktop, interactive chatbot, or production multi-user server.
The Insight (Rule of Thumb)
- Mode "local" (default):
- Max batch size: 4
- Max total sequence length: min(model_max, context_window, 8192)
- Use case: Local desktop deployment with low concurrency
- Trade-off: Uses less GPU memory; good for consumer GPUs
- Mode "interactive":
- Max batch size: 1
- Max total sequence length: min(model_max, context_window)
- Use case: Single-user chatbot, maximum context per request
- Trade-off: All memory dedicated to a single sequence
- Mode "server":
- Max batch size: auto-inferred to maximize GPU utilization
- Max total sequence length: auto-inferred to fill available memory
- Use case: Production serving with many concurrent requests
- Trade-off: Uses as much GPU memory as possible (within `gpu_memory_utilization`)
- RNN models (RWKV):
- Max batch size: 4 (local/server) or 1 (interactive)
- Max history size: auto-inferred from available memory
- Prefill chunk size: min(model_max, 4096)
Reasoning
The three modes represent common deployment patterns with fundamentally different resource allocation needs:
In local mode, the engine caps the KV cache at 8192 tokens and batch size at 4, which is sufficient for most consumer GPU setups (8-16GB VRAM) while leaving memory for other applications.
In interactive mode, a single sequence gets all available KV cache capacity, enabling maximum context window usage for a conversational use case.
In server mode, the engine attempts to allocate the maximum possible batch and cache capacity based on `gpu_memory_utilization`, targeting high throughput with many concurrent users.
# From config.py:26-45
# Mode "local" refers to local server deployment with low concurrency.
# Max batch size: 4, max total seq and prefill chunk: context window.
# Mode "interactive" refers to interactive use, at most 1 concurrent request.
# Max batch size: 1, max total seq and prefill chunk: context window.
# Mode "server" refers to large server with many concurrent requests.
# Automatically infer largest possible batch and total seq length.
// From config.cc:759-772
if (mode == EngineMode::kLocal) {
inferred_config.max_total_sequence_length = std::min(
{model_max_total_sequence_length,
inferred_config.max_single_sequence_length.value(),
static_cast<int64_t>(8192)});
} else if (mode == EngineMode::kInteractive) {
inferred_config.max_total_sequence_length = std::min(
{model_max_total_sequence_length,
inferred_config.max_single_sequence_length.value()});
} else { // server
inferred_config.max_total_sequence_length =
max_num_sequence * inferred_config.max_single_sequence_length.value();
}