Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Mlc ai Mlc llm Engine Mode Selection

From Leeroopedia



Knowledge Sources
Domains Optimization, Deployment
Last Updated 2026-02-09 19:00 GMT

Overview

Decision guide for selecting the engine mode (`local`, `interactive`, `server`) which controls automatic inference of max batch size, KV cache capacity, and prefill chunk size.

Description

MLC-LLM provides three engine modes that control how the engine auto-configures its resource allocation. Each mode makes different trade-offs between memory usage, concurrency, and context length. The mode determines default values for `max_num_sequence` (batch size), `max_total_sequence_length` (KV cache capacity), and `prefill_chunk_size` when these are not explicitly set by the user. Users can always override any of these values manually.

Usage

Apply this heuristic when launching the engine (via `mlc_llm serve`, Python API, or JIT) and you need to decide which mode to use. The choice depends on the deployment scenario: single-user desktop, interactive chatbot, or production multi-user server.

The Insight (Rule of Thumb)

  • Mode "local" (default):
    • Max batch size: 4
    • Max total sequence length: min(model_max, context_window, 8192)
    • Use case: Local desktop deployment with low concurrency
    • Trade-off: Uses less GPU memory; good for consumer GPUs
  • Mode "interactive":
    • Max batch size: 1
    • Max total sequence length: min(model_max, context_window)
    • Use case: Single-user chatbot, maximum context per request
    • Trade-off: All memory dedicated to a single sequence
  • Mode "server":
    • Max batch size: auto-inferred to maximize GPU utilization
    • Max total sequence length: auto-inferred to fill available memory
    • Use case: Production serving with many concurrent requests
    • Trade-off: Uses as much GPU memory as possible (within `gpu_memory_utilization`)
  • RNN models (RWKV):
    • Max batch size: 4 (local/server) or 1 (interactive)
    • Max history size: auto-inferred from available memory
    • Prefill chunk size: min(model_max, 4096)

Reasoning

The three modes represent common deployment patterns with fundamentally different resource allocation needs:

In local mode, the engine caps the KV cache at 8192 tokens and batch size at 4, which is sufficient for most consumer GPU setups (8-16GB VRAM) while leaving memory for other applications.

In interactive mode, a single sequence gets all available KV cache capacity, enabling maximum context window usage for a conversational use case.

In server mode, the engine attempts to allocate the maximum possible batch and cache capacity based on `gpu_memory_utilization`, targeting high throughput with many concurrent users.

# From config.py:26-45
# Mode "local" refers to local server deployment with low concurrency.
# Max batch size: 4, max total seq and prefill chunk: context window.
# Mode "interactive" refers to interactive use, at most 1 concurrent request.
# Max batch size: 1, max total seq and prefill chunk: context window.
# Mode "server" refers to large server with many concurrent requests.
# Automatically infer largest possible batch and total seq length.
// From config.cc:759-772
if (mode == EngineMode::kLocal) {
    inferred_config.max_total_sequence_length = std::min(
        {model_max_total_sequence_length,
         inferred_config.max_single_sequence_length.value(),
         static_cast<int64_t>(8192)});
} else if (mode == EngineMode::kInteractive) {
    inferred_config.max_total_sequence_length = std::min(
        {model_max_total_sequence_length,
         inferred_config.max_single_sequence_length.value()});
} else {  // server
    inferred_config.max_total_sequence_length =
        max_num_sequence * inferred_config.max_single_sequence_length.value();
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment