Implementation:Mlc ai Mlc llm EngineConfig
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Serving, Systems_Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for configuring inference engine parameters for optimal performance provided by MLC-LLM.
Description
EngineConfig is a Python dataclass that encapsulates all tunable parameters for the MLC-LLM serving engine. It governs model selection, memory management, batching strategy, KV cache sizing, speculative decoding, prefix caching, and prefill modes. The class provides a structured way to pass configuration from the user-facing API (CLI, Python constructor) through to the underlying C++ engine runtime.
Key responsibilities:
- Mode-Based Defaults: When fields like
max_num_sequenceormax_total_sequence_lengthare left asNone, the engine infers appropriate values based on the selectedmode("local", "interactive", or "server"). - JSON Serialization: The
asjson()method serializes the config to JSON for passing across the Python/C++ boundary via TVM FFI. Thefrom_json()static method reconstructs the config from a JSON string. - Validation: Upstream code validates that the config does not conflict with constructor arguments (e.g.,
model,model_lib,modecannot be set to conflicting values in both the constructor and the config).
Usage
Use EngineConfig when you need fine-grained control over serving behavior beyond what the preset modes provide. Pass it as the engine_config argument to AsyncMLCEngine, MLCEngine, or the serve() function. For most use cases, setting the mode alone is sufficient; explicit parameter overrides are needed only for advanced tuning.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/serve/config.py(Lines 8-169)
Signature
@dataclass
class EngineConfig:
model: Optional[str] = None
model_lib: Optional[str] = None
additional_models: List[Union[str, Tuple[str, str]]] = field(default_factory=list)
mode: Optional[Literal["local", "interactive", "server"]] = None
tensor_parallel_shards: Optional[int] = None
pipeline_parallel_stages: Optional[int] = None
opt: Optional[str] = None
gpu_memory_utilization: Optional[float] = None
kv_cache_page_size: int = 16
max_num_sequence: Optional[int] = None
max_total_sequence_length: Optional[int] = None
max_single_sequence_length: Optional[int] = None
prefill_chunk_size: Optional[int] = None
sliding_window_size: Optional[int] = None
attention_sink_size: Optional[int] = None
max_history_size: Optional[int] = None
kv_state_kind: Optional[Literal["kv_cache", "rnn_state"]] = None
speculative_mode: Literal["disable", "small_draft", "eagle", "medusa"] = "disable"
spec_draft_length: int = 0
spec_tree_width: int = 1
prefix_cache_mode: Literal["disable", "radix"] = "radix"
prefix_cache_max_num_recycling_seqs: Optional[int] = None
prefill_mode: Literal["chunked", "hybrid"] = "hybrid"
verbose: bool = True
Import
from mlc_llm.serve.config import EngineConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | Optional[str] |
No | Path to the model directory. Typically set by the engine constructor rather than directly. |
| model_lib | Optional[str] |
No | Path to the compiled model library. If None, JIT compilation is triggered.
|
| additional_models | List[Union[str, Tuple[str, str]]] |
No | Paths to additional model directories (and optional library paths) for multi-model serving. Defaults to empty list. |
| mode | Optional[Literal["local", "interactive", "server"]] |
No | Engine mode preset. Controls default values for batch size, sequence length limits, and prefill chunk size. |
| tensor_parallel_shards | Optional[int] |
No | Number of tensor parallelism shards for multi-GPU inference. |
| pipeline_parallel_stages | Optional[int] |
No | Number of pipeline parallelism stages for distributing model layers. |
| opt | Optional[str] |
No | Optimization flags for JIT compilation (e.g., "O0", "O2", "O3", or explicit knobs like "cublas_gemm=1;cudagraph=0").
|
| gpu_memory_utilization | Optional[float] |
No | Fraction of GPU memory to use (0 to 1). Defaults to 0.85 when unspecified. |
| kv_cache_page_size | int |
No | Number of consecutive tokens per page in paged KV cache. Must be 16. Defaults to 16. |
| max_num_sequence | Optional[int] |
No | Maximum number of concurrent sequences (batch size). Auto-inferred from mode if None.
|
| max_total_sequence_length | Optional[int] |
No | Maximum total token count across all active sequences in KV cache. Auto-inferred from mode if None.
|
| max_single_sequence_length | Optional[int] |
No | Maximum length of a single sequence. |
| prefill_chunk_size | Optional[int] |
No | Maximum total sequence length processed in a single prefill step. Auto-inferred from mode if None.
|
| sliding_window_size | Optional[int] |
No | Sliding window size for sliding window attention (SWA). |
| attention_sink_size | Optional[int] |
No | Number of attention sink tokens retained when sliding window is enabled. |
| max_history_size | Optional[int] |
No | Maximum RNN state history size for rollback. |
| kv_state_kind | Optional[Literal["kv_cache", "rnn_state"]] |
No | Kind of state cache: traditional KV cache or RNN state. |
| speculative_mode | Literal["disable", "small_draft", "eagle", "medusa"] |
No | Speculative decoding strategy. Defaults to "disable".
|
| spec_draft_length | int |
No | Number of speculative draft tokens per step. 0 enables adaptive mode. Defaults to 0. |
| spec_tree_width | int |
No | Width of the speculative decoding tree. Defaults to 1. |
| prefix_cache_mode | Literal["disable", "radix"] |
No | Prefix cache strategy. Defaults to "radix" (paged radix tree).
|
| prefix_cache_max_num_recycling_seqs | Optional[int] |
No | Maximum recycling sequences in prefix cache. 0 disables, -1 means infinite capacity. |
| prefill_mode | Literal["chunked", "hybrid"] |
No | Prefill strategy: basic chunked prefill or hybrid prefill (split-fuse). Defaults to "hybrid".
|
| verbose | bool |
No | Whether to print engine logging information. Defaults to True.
|
Outputs
| Name | Type | Description |
|---|---|---|
| EngineConfig instance | EngineConfig |
A configured dataclass instance. Use asjson() to serialize to JSON string for C++ engine initialization.
|
Usage Examples
Basic Usage with Mode Preset
from mlc_llm.serve.config import EngineConfig
# Use server mode defaults -- auto-configure for maximum throughput
config = EngineConfig(mode="server")
Explicit Memory and Batching Configuration
from mlc_llm.serve.config import EngineConfig
config = EngineConfig(
gpu_memory_utilization=0.90,
max_num_sequence=32,
max_total_sequence_length=32768,
prefill_chunk_size=4096,
prefix_cache_mode="radix",
prefill_mode="hybrid",
)
Speculative Decoding Configuration
from mlc_llm.serve.config import EngineConfig
config = EngineConfig(
speculative_mode="eagle",
spec_draft_length=4,
spec_tree_width=2,
)
Passing to Engine Constructor
from mlc_llm.serve.engine import AsyncMLCEngine
from mlc_llm.serve.config import EngineConfig
engine = AsyncMLCEngine(
model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
device="cuda",
mode="server",
engine_config=EngineConfig(
gpu_memory_utilization=0.85,
max_num_sequence=64,
speculative_mode="disable",
prefix_cache_mode="radix",
),
)