Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Mlc llm EngineConfig

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Serving, Systems_Engineering
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for configuring inference engine parameters for optimal performance provided by MLC-LLM.

Description

EngineConfig is a Python dataclass that encapsulates all tunable parameters for the MLC-LLM serving engine. It governs model selection, memory management, batching strategy, KV cache sizing, speculative decoding, prefix caching, and prefill modes. The class provides a structured way to pass configuration from the user-facing API (CLI, Python constructor) through to the underlying C++ engine runtime.

Key responsibilities:

  • Mode-Based Defaults: When fields like max_num_sequence or max_total_sequence_length are left as None, the engine infers appropriate values based on the selected mode ("local", "interactive", or "server").
  • JSON Serialization: The asjson() method serializes the config to JSON for passing across the Python/C++ boundary via TVM FFI. The from_json() static method reconstructs the config from a JSON string.
  • Validation: Upstream code validates that the config does not conflict with constructor arguments (e.g., model, model_lib, mode cannot be set to conflicting values in both the constructor and the config).

Usage

Use EngineConfig when you need fine-grained control over serving behavior beyond what the preset modes provide. Pass it as the engine_config argument to AsyncMLCEngine, MLCEngine, or the serve() function. For most use cases, setting the mode alone is sufficient; explicit parameter overrides are needed only for advanced tuning.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/serve/config.py (Lines 8-169)

Signature

@dataclass
class EngineConfig:
    model: Optional[str] = None
    model_lib: Optional[str] = None
    additional_models: List[Union[str, Tuple[str, str]]] = field(default_factory=list)
    mode: Optional[Literal["local", "interactive", "server"]] = None
    tensor_parallel_shards: Optional[int] = None
    pipeline_parallel_stages: Optional[int] = None
    opt: Optional[str] = None
    gpu_memory_utilization: Optional[float] = None
    kv_cache_page_size: int = 16
    max_num_sequence: Optional[int] = None
    max_total_sequence_length: Optional[int] = None
    max_single_sequence_length: Optional[int] = None
    prefill_chunk_size: Optional[int] = None
    sliding_window_size: Optional[int] = None
    attention_sink_size: Optional[int] = None
    max_history_size: Optional[int] = None
    kv_state_kind: Optional[Literal["kv_cache", "rnn_state"]] = None
    speculative_mode: Literal["disable", "small_draft", "eagle", "medusa"] = "disable"
    spec_draft_length: int = 0
    spec_tree_width: int = 1
    prefix_cache_mode: Literal["disable", "radix"] = "radix"
    prefix_cache_max_num_recycling_seqs: Optional[int] = None
    prefill_mode: Literal["chunked", "hybrid"] = "hybrid"
    verbose: bool = True

Import

from mlc_llm.serve.config import EngineConfig

I/O Contract

Inputs

Name Type Required Description
model Optional[str] No Path to the model directory. Typically set by the engine constructor rather than directly.
model_lib Optional[str] No Path to the compiled model library. If None, JIT compilation is triggered.
additional_models List[Union[str, Tuple[str, str]]] No Paths to additional model directories (and optional library paths) for multi-model serving. Defaults to empty list.
mode Optional[Literal["local", "interactive", "server"]] No Engine mode preset. Controls default values for batch size, sequence length limits, and prefill chunk size.
tensor_parallel_shards Optional[int] No Number of tensor parallelism shards for multi-GPU inference.
pipeline_parallel_stages Optional[int] No Number of pipeline parallelism stages for distributing model layers.
opt Optional[str] No Optimization flags for JIT compilation (e.g., "O0", "O2", "O3", or explicit knobs like "cublas_gemm=1;cudagraph=0").
gpu_memory_utilization Optional[float] No Fraction of GPU memory to use (0 to 1). Defaults to 0.85 when unspecified.
kv_cache_page_size int No Number of consecutive tokens per page in paged KV cache. Must be 16. Defaults to 16.
max_num_sequence Optional[int] No Maximum number of concurrent sequences (batch size). Auto-inferred from mode if None.
max_total_sequence_length Optional[int] No Maximum total token count across all active sequences in KV cache. Auto-inferred from mode if None.
max_single_sequence_length Optional[int] No Maximum length of a single sequence.
prefill_chunk_size Optional[int] No Maximum total sequence length processed in a single prefill step. Auto-inferred from mode if None.
sliding_window_size Optional[int] No Sliding window size for sliding window attention (SWA).
attention_sink_size Optional[int] No Number of attention sink tokens retained when sliding window is enabled.
max_history_size Optional[int] No Maximum RNN state history size for rollback.
kv_state_kind Optional[Literal["kv_cache", "rnn_state"]] No Kind of state cache: traditional KV cache or RNN state.
speculative_mode Literal["disable", "small_draft", "eagle", "medusa"] No Speculative decoding strategy. Defaults to "disable".
spec_draft_length int No Number of speculative draft tokens per step. 0 enables adaptive mode. Defaults to 0.
spec_tree_width int No Width of the speculative decoding tree. Defaults to 1.
prefix_cache_mode Literal["disable", "radix"] No Prefix cache strategy. Defaults to "radix" (paged radix tree).
prefix_cache_max_num_recycling_seqs Optional[int] No Maximum recycling sequences in prefix cache. 0 disables, -1 means infinite capacity.
prefill_mode Literal["chunked", "hybrid"] No Prefill strategy: basic chunked prefill or hybrid prefill (split-fuse). Defaults to "hybrid".
verbose bool No Whether to print engine logging information. Defaults to True.

Outputs

Name Type Description
EngineConfig instance EngineConfig A configured dataclass instance. Use asjson() to serialize to JSON string for C++ engine initialization.

Usage Examples

Basic Usage with Mode Preset

from mlc_llm.serve.config import EngineConfig

# Use server mode defaults -- auto-configure for maximum throughput
config = EngineConfig(mode="server")

Explicit Memory and Batching Configuration

from mlc_llm.serve.config import EngineConfig

config = EngineConfig(
    gpu_memory_utilization=0.90,
    max_num_sequence=32,
    max_total_sequence_length=32768,
    prefill_chunk_size=4096,
    prefix_cache_mode="radix",
    prefill_mode="hybrid",
)

Speculative Decoding Configuration

from mlc_llm.serve.config import EngineConfig

config = EngineConfig(
    speculative_mode="eagle",
    spec_draft_length=4,
    spec_tree_width=2,
)

Passing to Engine Constructor

from mlc_llm.serve.engine import AsyncMLCEngine
from mlc_llm.serve.config import EngineConfig

engine = AsyncMLCEngine(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    device="cuda",
    mode="server",
    engine_config=EngineConfig(
        gpu_memory_utilization=0.85,
        max_num_sequence=64,
        speculative_mode="disable",
        prefix_cache_mode="radix",
    ),
)

Related Pages

Implements Principle

Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment