Implementation:Mlc ai Mlc llm Compile

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Model_Deployment, Compiler_Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for compiling a neural network model into an optimized, platform-specific binary library using TVM's Relax compiler pipeline, provided by MLC-LLM.

Description

The compile function is the entrypoint for MLC-LLM's model library compilation. It takes a model configuration dictionary, quantization scheme, target hardware specification, and optimization flags, then produces a compiled binary library (.so, .tar, or other format) suitable for the target platform. The compilation pipeline:

Parses the model configuration dictionary and applies any user-specified overrides (context window size, tensor parallelism, etc.) via ModelConfigOverride.
Creates the quantized model by applying the specified quantization scheme to the model architecture.
Exports the model to TVM's Relax IR using model.export_tvm(), which produces an IRModule, named parameter list, and any external modules.
Applies pre-processing annotations to parameters (shard strategies for tensor parallelism, pipeline stage assignments for pipeline parallelism).
Computes variable bounds for symbolic shapes (sequence length, batch size, total sequence length) based on model configuration.
Registers metadata in the compiled library, including model type, quantization name, context window parameters, parallelism settings, KV state kind (kv_cache, rnn_state, or none), and per-parameter preprocessing instructions.
Runs the TVM Relax optimization pipeline (relax.get_pipeline("mlc_llm")) with target-specific optimizations including FlashInfer, cuBLAS GEMM, FasterTransformer, CUTLASS, IPC all-reduce, and CUDA graph capture.
Invokes the build function to produce the final binary output.
Reports estimated memory usage from the compiled metadata.

Usage

Use this function as the fourth step of the MLC-LLM compilation pipeline, after weight conversion. It produces the model library that is loaded by MLCEngine at inference time. It is also invoked internally by the JIT compilation path when a pre-compiled library is not found.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/interface/compile.py (lines 217-254)

Signature

def compile(
    config: Dict[str, Any],
    quantization: Quantization,
    model_type: Model,
    target: Target,
    opt: OptimizationFlags,
    build_func: Callable[[IRModule, CompileArgs, Pass], None],
    system_lib_prefix: str,
    output: Path,
    overrides: ModelConfigOverride,
    debug_dump: Optional[Path] = None,
):
    """Compile a model given its configuration and quantization format to a specific target."""

Import

from mlc_llm.interface.compile import compile
from mlc_llm.interface.compiler_flags import OptimizationFlags, ModelConfigOverride

I/O Contract

Inputs

Name	Type	Required	Description
config	Dict[str, Any]	Yes	Model configuration dictionary, typically loaded from `mlc-chat-config.json`. May contain a nested `"model_config"` key or be a flat dictionary of model architecture parameters.
quantization	Quantization	Yes	The quantization scheme object specifying the quantization algorithm, bit width, and other parameters. Must match the quantization used during weight conversion.
model_type	Model	Yes	The MLC model descriptor providing the model class, quantization methods, and configuration parser. Obtained from the MLC model registry.
target	Target	Yes	The TVM compilation target specifying the hardware backend (e.g., `Target("cuda")`, `Target("vulkan")`, `Target("metal")`, `Target("llvm")`).
opt	OptimizationFlags	Yes	Optimization flags controlling which acceleration libraries and techniques to enable. Includes flags for FlashInfer, cuBLAS GEMM, FasterTransformer, CUDA graphs, CUTLASS, and IPC all-reduce strategy. Preset levels O0-O3 are available.
build_func	Callable[[IRModule, CompileArgs, Pass], None]	Yes	The build function that takes the optimized IR module, compile arguments, and optimization pipeline pass, and produces the binary output file. Typically `tvm.relax.build` or a wrapper thereof.
system_lib_prefix	str	Yes	A prefix string for the system library name, used when building system libraries for static linking (relevant for mobile and WebAssembly targets).
output	Path	Yes	Path to the output file for the compiled model library (e.g., `model.so`, `model.tar`).
overrides	ModelConfigOverride	Yes	Runtime overrides for model configuration fields such as `context_window_size`, `prefill_chunk_size`, `tensor_parallel_shards`, and `max_batch_size`. Use default-constructed `ModelConfigOverride()` for no overrides.
debug_dump	Optional[Path]	No (default: None)	Optional path to a directory for dumping intermediate IR representations for debugging purposes.

Outputs

Name	Type	Description
return value	None	The function returns nothing. Side effects include writing the compiled model library to the `output` path and logging compilation progress, metadata registration, and memory usage estimates.

Exceptions

Exception	Condition
NotImplementedError	Raised when ft-quant quantization is requested with tensor parallelism, or when KN layout quantization (q3f16_0, q4f16_0) is requested with tensor parallelism.

Usage Examples

Basic Usage

import json
from pathlib import Path
from tvm.target import Target
from mlc_llm.interface.compile import compile
from mlc_llm.interface.compiler_flags import OptimizationFlags, ModelConfigOverride
from mlc_llm.model import Model
from mlc_llm.quantization import Quantization

# Load model configuration
with open("./Llama-2-7b-chat-q4f16_1-MLC/mlc-chat-config.json") as f:
    config = json.load(f)

# Compile for CUDA target with O2 optimization level
compile(
    config=config,
    quantization=Quantization.from_name("q4f16_1"),
    model_type=Model.from_name("llama"),
    target=Target("cuda"),
    opt=OptimizationFlags.from_str("O2"),
    build_func=tvm_build_func,  # provided by the MLC build system
    system_lib_prefix="",
    output=Path("./Llama-2-7b-chat-q4f16_1-cuda.so"),
    overrides=ModelConfigOverride(),
)

Compilation with Tensor Parallelism

import json
from pathlib import Path
from tvm.target import Target
from mlc_llm.interface.compile import compile
from mlc_llm.interface.compiler_flags import OptimizationFlags, ModelConfigOverride

with open("./Llama-2-70b-chat-q4f16_1-MLC/mlc-chat-config.json") as f:
    config = json.load(f)

compile(
    config=config,
    quantization=quantization,
    model_type=model_type,
    target=Target("cuda"),
    opt=OptimizationFlags.from_str("O3"),
    build_func=tvm_build_func,
    system_lib_prefix="",
    output=Path("./Llama-2-70b-chat-q4f16_1-cuda.so"),
    overrides=ModelConfigOverride(
        tensor_parallel_shards=4,
        max_batch_size=16,
    ),
)

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Model_Library_Compilation

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment