Implementation:Mlc ai Mlc llm EngineConfig

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Model_Serving, Systems_Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for configuring inference engine parameters for optimal performance provided by MLC-LLM.

Description

EngineConfig is a Python dataclass that encapsulates all tunable parameters for the MLC-LLM serving engine. It governs model selection, memory management, batching strategy, KV cache sizing, speculative decoding, prefix caching, and prefill modes. The class provides a structured way to pass configuration from the user-facing API (CLI, Python constructor) through to the underlying C++ engine runtime.

Key responsibilities:

Mode-Based Defaults: When fields like max_num_sequence or max_total_sequence_length are left as None, the engine infers appropriate values based on the selected mode ("local", "interactive", or "server").
JSON Serialization: The asjson() method serializes the config to JSON for passing across the Python/C++ boundary via TVM FFI. The from_json() static method reconstructs the config from a JSON string.
Validation: Upstream code validates that the config does not conflict with constructor arguments (e.g., model, model_lib, mode cannot be set to conflicting values in both the constructor and the config).

Usage

Use EngineConfig when you need fine-grained control over serving behavior beyond what the preset modes provide. Pass it as the engine_config argument to AsyncMLCEngine, MLCEngine, or the serve() function. For most use cases, setting the mode alone is sufficient; explicit parameter overrides are needed only for advanced tuning.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/serve/config.py (Lines 8-169)

Signature

@dataclass
class EngineConfig:
    model: Optional[str] = None
    model_lib: Optional[str] = None
    additional_models: List[Union[str, Tuple[str, str]]] = field(default_factory=list)
    mode: Optional[Literal["local", "interactive", "server"]] = None
    tensor_parallel_shards: Optional[int] = None
    pipeline_parallel_stages: Optional[int] = None
    opt: Optional[str] = None
    gpu_memory_utilization: Optional[float] = None
    kv_cache_page_size: int = 16
    max_num_sequence: Optional[int] = None
    max_total_sequence_length: Optional[int] = None
    max_single_sequence_length: Optional[int] = None
    prefill_chunk_size: Optional[int] = None
    sliding_window_size: Optional[int] = None
    attention_sink_size: Optional[int] = None
    max_history_size: Optional[int] = None
    kv_state_kind: Optional[Literal["kv_cache", "rnn_state"]] = None
    speculative_mode: Literal["disable", "small_draft", "eagle", "medusa"] = "disable"
    spec_draft_length: int = 0
    spec_tree_width: int = 1
    prefix_cache_mode: Literal["disable", "radix"] = "radix"
    prefix_cache_max_num_recycling_seqs: Optional[int] = None
    prefill_mode: Literal["chunked", "hybrid"] = "hybrid"
    verbose: bool = True

Import

from mlc_llm.serve.config import EngineConfig

I/O Contract

Inputs

Name	Type	Required	Description
model	`Optional[str]`	No	Path to the model directory. Typically set by the engine constructor rather than directly.
model_lib	`Optional[str]`	No	Path to the compiled model library. If `None`, JIT compilation is triggered.
additional_models	`List[Union[str, Tuple[str, str]]]`	No	Paths to additional model directories (and optional library paths) for multi-model serving. Defaults to empty list.
mode	`Optional[Literal["local", "interactive", "server"]]`	No	Engine mode preset. Controls default values for batch size, sequence length limits, and prefill chunk size.
tensor_parallel_shards	`Optional[int]`	No	Number of tensor parallelism shards for multi-GPU inference.
pipeline_parallel_stages	`Optional[int]`	No	Number of pipeline parallelism stages for distributing model layers.
opt	`Optional[str]`	No	Optimization flags for JIT compilation (e.g., `"O0"`, `"O2"`, `"O3"`, or explicit knobs like `"cublas_gemm=1;cudagraph=0"`).
gpu_memory_utilization	`Optional[float]`	No	Fraction of GPU memory to use (0 to 1). Defaults to 0.85 when unspecified.
kv_cache_page_size	`int`	No	Number of consecutive tokens per page in paged KV cache. Must be 16. Defaults to 16.
max_num_sequence	`Optional[int]`	No	Maximum number of concurrent sequences (batch size). Auto-inferred from mode if `None`.
max_total_sequence_length	`Optional[int]`	No	Maximum total token count across all active sequences in KV cache. Auto-inferred from mode if `None`.
max_single_sequence_length	`Optional[int]`	No	Maximum length of a single sequence.
prefill_chunk_size	`Optional[int]`	No	Maximum total sequence length processed in a single prefill step. Auto-inferred from mode if `None`.
sliding_window_size	`Optional[int]`	No	Sliding window size for sliding window attention (SWA).
attention_sink_size	`Optional[int]`	No	Number of attention sink tokens retained when sliding window is enabled.
max_history_size	`Optional[int]`	No	Maximum RNN state history size for rollback.
kv_state_kind	`Optional[Literal["kv_cache", "rnn_state"]]`	No	Kind of state cache: traditional KV cache or RNN state.
speculative_mode	`Literal["disable", "small_draft", "eagle", "medusa"]`	No	Speculative decoding strategy. Defaults to `"disable"`.
spec_draft_length	`int`	No	Number of speculative draft tokens per step. 0 enables adaptive mode. Defaults to 0.
spec_tree_width	`int`	No	Width of the speculative decoding tree. Defaults to 1.
prefix_cache_mode	`Literal["disable", "radix"]`	No	Prefix cache strategy. Defaults to `"radix"` (paged radix tree).
prefix_cache_max_num_recycling_seqs	`Optional[int]`	No	Maximum recycling sequences in prefix cache. 0 disables, -1 means infinite capacity.
prefill_mode	`Literal["chunked", "hybrid"]`	No	Prefill strategy: basic chunked prefill or hybrid prefill (split-fuse). Defaults to `"hybrid"`.
verbose	`bool`	No	Whether to print engine logging information. Defaults to `True`.

Outputs

Name	Type	Description
EngineConfig instance	`EngineConfig`	A configured dataclass instance. Use `asjson()` to serialize to JSON string for C++ engine initialization.

Usage Examples

Basic Usage with Mode Preset

from mlc_llm.serve.config import EngineConfig

# Use server mode defaults -- auto-configure for maximum throughput
config = EngineConfig(mode="server")

Explicit Memory and Batching Configuration

from mlc_llm.serve.config import EngineConfig

config = EngineConfig(
    gpu_memory_utilization=0.90,
    max_num_sequence=32,
    max_total_sequence_length=32768,
    prefill_chunk_size=4096,
    prefix_cache_mode="radix",
    prefill_mode="hybrid",
)

Speculative Decoding Configuration

from mlc_llm.serve.config import EngineConfig

config = EngineConfig(
    speculative_mode="eagle",
    spec_draft_length=4,
    spec_tree_width=2,
)

Passing to Engine Constructor

from mlc_llm.serve.engine import AsyncMLCEngine
from mlc_llm.serve.config import EngineConfig

engine = AsyncMLCEngine(
    model="dist/models/Llama-2-7b-chat-hf-q4f16_1",
    device="cuda",
    mode="server",
    engine_config=EngineConfig(
        gpu_memory_utilization=0.85,
        max_num_sequence=64,
        speculative_mode="disable",
        prefix_cache_mode="radix",
    ),
)

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Engine_Configuration

Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment