Implementation:Mlc ai Mlc llm Compile
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Deployment, Compiler_Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for compiling a neural network model into an optimized, platform-specific binary library using TVM's Relax compiler pipeline, provided by MLC-LLM.
Description
The compile function is the entrypoint for MLC-LLM's model library compilation. It takes a model configuration dictionary, quantization scheme, target hardware specification, and optimization flags, then produces a compiled binary library (.so, .tar, or other format) suitable for the target platform. The compilation pipeline:
- Parses the model configuration dictionary and applies any user-specified overrides (context window size, tensor parallelism, etc.) via
ModelConfigOverride. - Creates the quantized model by applying the specified quantization scheme to the model architecture.
- Exports the model to TVM's Relax IR using
model.export_tvm(), which produces anIRModule, named parameter list, and any external modules. - Applies pre-processing annotations to parameters (shard strategies for tensor parallelism, pipeline stage assignments for pipeline parallelism).
- Computes variable bounds for symbolic shapes (sequence length, batch size, total sequence length) based on model configuration.
- Registers metadata in the compiled library, including model type, quantization name, context window parameters, parallelism settings, KV state kind (kv_cache, rnn_state, or none), and per-parameter preprocessing instructions.
- Runs the TVM Relax optimization pipeline (
relax.get_pipeline("mlc_llm")) with target-specific optimizations including FlashInfer, cuBLAS GEMM, FasterTransformer, CUTLASS, IPC all-reduce, and CUDA graph capture. - Invokes the build function to produce the final binary output.
- Reports estimated memory usage from the compiled metadata.
Usage
Use this function as the fourth step of the MLC-LLM compilation pipeline, after weight conversion. It produces the model library that is loaded by MLCEngine at inference time. It is also invoked internally by the JIT compilation path when a pre-compiled library is not found.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/interface/compile.py(lines 217-254)
Signature
def compile(
config: Dict[str, Any],
quantization: Quantization,
model_type: Model,
target: Target,
opt: OptimizationFlags,
build_func: Callable[[IRModule, CompileArgs, Pass], None],
system_lib_prefix: str,
output: Path,
overrides: ModelConfigOverride,
debug_dump: Optional[Path] = None,
):
"""Compile a model given its configuration and quantization format to a specific target."""
Import
from mlc_llm.interface.compile import compile
from mlc_llm.interface.compiler_flags import OptimizationFlags, ModelConfigOverride
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | Dict[str, Any] | Yes | Model configuration dictionary, typically loaded from mlc-chat-config.json. May contain a nested "model_config" key or be a flat dictionary of model architecture parameters.
|
| quantization | Quantization | Yes | The quantization scheme object specifying the quantization algorithm, bit width, and other parameters. Must match the quantization used during weight conversion. |
| model_type | Model | Yes | The MLC model descriptor providing the model class, quantization methods, and configuration parser. Obtained from the MLC model registry. |
| target | Target | Yes | The TVM compilation target specifying the hardware backend (e.g., Target("cuda"), Target("vulkan"), Target("metal"), Target("llvm")).
|
| opt | OptimizationFlags | Yes | Optimization flags controlling which acceleration libraries and techniques to enable. Includes flags for FlashInfer, cuBLAS GEMM, FasterTransformer, CUDA graphs, CUTLASS, and IPC all-reduce strategy. Preset levels O0-O3 are available. |
| build_func | Callable[[IRModule, CompileArgs, Pass], None] | Yes | The build function that takes the optimized IR module, compile arguments, and optimization pipeline pass, and produces the binary output file. Typically tvm.relax.build or a wrapper thereof.
|
| system_lib_prefix | str | Yes | A prefix string for the system library name, used when building system libraries for static linking (relevant for mobile and WebAssembly targets). |
| output | Path | Yes | Path to the output file for the compiled model library (e.g., model.so, model.tar).
|
| overrides | ModelConfigOverride | Yes | Runtime overrides for model configuration fields such as context_window_size, prefill_chunk_size, tensor_parallel_shards, and max_batch_size. Use default-constructed ModelConfigOverride() for no overrides.
|
| debug_dump | Optional[Path] | No (default: None) | Optional path to a directory for dumping intermediate IR representations for debugging purposes. |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | None | The function returns nothing. Side effects include writing the compiled model library to the output path and logging compilation progress, metadata registration, and memory usage estimates.
|
Exceptions
| Exception | Condition |
|---|---|
| NotImplementedError | Raised when ft-quant quantization is requested with tensor parallelism, or when KN layout quantization (q3f16_0, q4f16_0) is requested with tensor parallelism. |
Usage Examples
Basic Usage
import json
from pathlib import Path
from tvm.target import Target
from mlc_llm.interface.compile import compile
from mlc_llm.interface.compiler_flags import OptimizationFlags, ModelConfigOverride
from mlc_llm.model import Model
from mlc_llm.quantization import Quantization
# Load model configuration
with open("./Llama-2-7b-chat-q4f16_1-MLC/mlc-chat-config.json") as f:
config = json.load(f)
# Compile for CUDA target with O2 optimization level
compile(
config=config,
quantization=Quantization.from_name("q4f16_1"),
model_type=Model.from_name("llama"),
target=Target("cuda"),
opt=OptimizationFlags.from_str("O2"),
build_func=tvm_build_func, # provided by the MLC build system
system_lib_prefix="",
output=Path("./Llama-2-7b-chat-q4f16_1-cuda.so"),
overrides=ModelConfigOverride(),
)
Compilation with Tensor Parallelism
import json
from pathlib import Path
from tvm.target import Target
from mlc_llm.interface.compile import compile
from mlc_llm.interface.compiler_flags import OptimizationFlags, ModelConfigOverride
with open("./Llama-2-70b-chat-q4f16_1-MLC/mlc-chat-config.json") as f:
config = json.load(f)
compile(
config=config,
quantization=quantization,
model_type=model_type,
target=Target("cuda"),
opt=OptimizationFlags.from_str("O3"),
build_func=tvm_build_func,
system_lib_prefix="",
output=Path("./Llama-2-70b-chat-q4f16_1-cuda.so"),
overrides=ModelConfigOverride(
tensor_parallel_shards=4,
max_batch_size=16,
),
)