Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Mlc llm Compile

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Deployment, Compiler_Engineering
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for compiling a neural network model into an optimized, platform-specific binary library using TVM's Relax compiler pipeline, provided by MLC-LLM.

Description

The compile function is the entrypoint for MLC-LLM's model library compilation. It takes a model configuration dictionary, quantization scheme, target hardware specification, and optimization flags, then produces a compiled binary library (.so, .tar, or other format) suitable for the target platform. The compilation pipeline:

  1. Parses the model configuration dictionary and applies any user-specified overrides (context window size, tensor parallelism, etc.) via ModelConfigOverride.
  2. Creates the quantized model by applying the specified quantization scheme to the model architecture.
  3. Exports the model to TVM's Relax IR using model.export_tvm(), which produces an IRModule, named parameter list, and any external modules.
  4. Applies pre-processing annotations to parameters (shard strategies for tensor parallelism, pipeline stage assignments for pipeline parallelism).
  5. Computes variable bounds for symbolic shapes (sequence length, batch size, total sequence length) based on model configuration.
  6. Registers metadata in the compiled library, including model type, quantization name, context window parameters, parallelism settings, KV state kind (kv_cache, rnn_state, or none), and per-parameter preprocessing instructions.
  7. Runs the TVM Relax optimization pipeline (relax.get_pipeline("mlc_llm")) with target-specific optimizations including FlashInfer, cuBLAS GEMM, FasterTransformer, CUTLASS, IPC all-reduce, and CUDA graph capture.
  8. Invokes the build function to produce the final binary output.
  9. Reports estimated memory usage from the compiled metadata.

Usage

Use this function as the fourth step of the MLC-LLM compilation pipeline, after weight conversion. It produces the model library that is loaded by MLCEngine at inference time. It is also invoked internally by the JIT compilation path when a pre-compiled library is not found.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/interface/compile.py (lines 217-254)

Signature

def compile(
    config: Dict[str, Any],
    quantization: Quantization,
    model_type: Model,
    target: Target,
    opt: OptimizationFlags,
    build_func: Callable[[IRModule, CompileArgs, Pass], None],
    system_lib_prefix: str,
    output: Path,
    overrides: ModelConfigOverride,
    debug_dump: Optional[Path] = None,
):
    """Compile a model given its configuration and quantization format to a specific target."""

Import

from mlc_llm.interface.compile import compile
from mlc_llm.interface.compiler_flags import OptimizationFlags, ModelConfigOverride

I/O Contract

Inputs

Name Type Required Description
config Dict[str, Any] Yes Model configuration dictionary, typically loaded from mlc-chat-config.json. May contain a nested "model_config" key or be a flat dictionary of model architecture parameters.
quantization Quantization Yes The quantization scheme object specifying the quantization algorithm, bit width, and other parameters. Must match the quantization used during weight conversion.
model_type Model Yes The MLC model descriptor providing the model class, quantization methods, and configuration parser. Obtained from the MLC model registry.
target Target Yes The TVM compilation target specifying the hardware backend (e.g., Target("cuda"), Target("vulkan"), Target("metal"), Target("llvm")).
opt OptimizationFlags Yes Optimization flags controlling which acceleration libraries and techniques to enable. Includes flags for FlashInfer, cuBLAS GEMM, FasterTransformer, CUDA graphs, CUTLASS, and IPC all-reduce strategy. Preset levels O0-O3 are available.
build_func Callable[[IRModule, CompileArgs, Pass], None] Yes The build function that takes the optimized IR module, compile arguments, and optimization pipeline pass, and produces the binary output file. Typically tvm.relax.build or a wrapper thereof.
system_lib_prefix str Yes A prefix string for the system library name, used when building system libraries for static linking (relevant for mobile and WebAssembly targets).
output Path Yes Path to the output file for the compiled model library (e.g., model.so, model.tar).
overrides ModelConfigOverride Yes Runtime overrides for model configuration fields such as context_window_size, prefill_chunk_size, tensor_parallel_shards, and max_batch_size. Use default-constructed ModelConfigOverride() for no overrides.
debug_dump Optional[Path] No (default: None) Optional path to a directory for dumping intermediate IR representations for debugging purposes.

Outputs

Name Type Description
return value None The function returns nothing. Side effects include writing the compiled model library to the output path and logging compilation progress, metadata registration, and memory usage estimates.

Exceptions

Exception Condition
NotImplementedError Raised when ft-quant quantization is requested with tensor parallelism, or when KN layout quantization (q3f16_0, q4f16_0) is requested with tensor parallelism.

Usage Examples

Basic Usage

import json
from pathlib import Path
from tvm.target import Target
from mlc_llm.interface.compile import compile
from mlc_llm.interface.compiler_flags import OptimizationFlags, ModelConfigOverride
from mlc_llm.model import Model
from mlc_llm.quantization import Quantization

# Load model configuration
with open("./Llama-2-7b-chat-q4f16_1-MLC/mlc-chat-config.json") as f:
    config = json.load(f)

# Compile for CUDA target with O2 optimization level
compile(
    config=config,
    quantization=Quantization.from_name("q4f16_1"),
    model_type=Model.from_name("llama"),
    target=Target("cuda"),
    opt=OptimizationFlags.from_str("O2"),
    build_func=tvm_build_func,  # provided by the MLC build system
    system_lib_prefix="",
    output=Path("./Llama-2-7b-chat-q4f16_1-cuda.so"),
    overrides=ModelConfigOverride(),
)

Compilation with Tensor Parallelism

import json
from pathlib import Path
from tvm.target import Target
from mlc_llm.interface.compile import compile
from mlc_llm.interface.compiler_flags import OptimizationFlags, ModelConfigOverride

with open("./Llama-2-70b-chat-q4f16_1-MLC/mlc-chat-config.json") as f:
    config = json.load(f)

compile(
    config=config,
    quantization=quantization,
    model_type=model_type,
    target=Target("cuda"),
    opt=OptimizationFlags.from_str("O3"),
    build_func=tvm_build_func,
    system_lib_prefix="",
    output=Path("./Llama-2-70b-chat-q4f16_1-cuda.so"),
    overrides=ModelConfigOverride(
        tensor_parallel_shards=4,
        max_batch_size=16,
    ),
)

Related Pages

Implements Principle

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment