Implementation:Mlc ai Mlc llm Convert weight

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Model_Deployment, Model_Optimization
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for converting model weights from training format to optimized MLC inference format with quantization, provided by MLC-LLM.

Description

The convert_weight function is MLC-LLM's weight conversion and quantization entrypoint. It reads source model weights in various formats (HuggingFace safetensors, PyTorch bin, GGUF, AWQ), applies the specified quantization scheme, validates each converted parameter against the expected model shape and dtype, and writes the quantized weights to a TVM tensor cache directory. The conversion pipeline:

Loads the model configuration from the provided config path and creates the quantized model definition.
Exports the model to TVM IR to determine the expected parameter names, shapes, and dtypes.
Optionally applies pre-sharding for tensor-parallel deployments (controlled by the MLC_INTERNAL_PRESHARD_NUM environment variable).
Uses a format-specific loader (registered in mlc_llm.loader.LOADER) to stream source weights, applying quantization transformations on the fly.
Validates each parameter's shape and dtype against the expected values and raises errors on mismatch.
Writes parameters to disk using tvmjs.dump_tensor_cache with f32-to-bf16 encoding.
Reports statistics including total parameter size after quantization (in GB), total parameter count, and bits per parameter.

Usage

Use this function as the third step of the MLC-LLM compilation pipeline, after generating the deployment configuration with gen_config. It is required whenever you need to produce quantized weight files for a new model or a new quantization setting.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/interface/convert_weight.py (lines 169-181)

Signature

def convert_weight(
    config: Path,
    quantization: Quantization,
    model: Model,
    device: Device,
    source: Path,
    source_format: str,
    output: Path,
):
    """MLC LLM's weight conversation and quantization flow."""

Import

from mlc_llm.interface.convert_weight import convert_weight

I/O Contract

Inputs

Name	Type	Required	Description
config	Path	Yes	Path to the model's `config.json` file. Used to load the model architecture definition that determines expected parameter shapes and dtypes.
quantization	Quantization	Yes	The quantization scheme object (e.g., q4f16_1, q3f16_0, q0f16). Determines the quantization algorithm, bit width, and group size applied to eligible parameters.
model	Model	Yes	The MLC model descriptor object that provides the model class, quantization methods, and source format mappings. Obtained from the MLC model registry.
device	Device	Yes	The TVM device to use during quantization computation (e.g., `cpu()`, `cuda(0)`). Quantization kernels run on this device before results are copied back to CPU for storage.
source	Path	Yes	Path to the directory containing source model weights in the format specified by `source_format`.
source_format	str	Yes	The format of the source weights. Supported values include `"huggingface-torch"`, `"huggingface-safetensor"`, `"gguf"`, and `"awq"`.
output	Path	Yes	Path to the output directory where quantized weight files (TVM tensor cache) will be written. The directory will contain binary shard files and a `tensor-cache.json` manifest.

Outputs

Name	Type	Description
return value	None	The function returns nothing. Side effects include writing the quantized weight tensor cache to the `output` directory and logging conversion statistics (parameter size in GB, total parameters, bits per parameter).

Exceptions

Exception	Condition
ValueError	Raised when a converted parameter's shape or dtype does not match the model's expected values, when a duplicate parameter is encountered, or when a required parameter is missing from the source weights.
NotImplementedError	Raised when ft-quant quantization is requested with tensor parallelism (tensor_parallel_shards > 1), which is not yet supported.

Usage Examples

Basic Usage

from pathlib import Path
from tvm.runtime import cpu as cpu_device
from mlc_llm.interface.convert_weight import convert_weight
from mlc_llm.model import Model
from mlc_llm.quantization import Quantization

# Convert HuggingFace model weights to MLC format with INT4 quantization
convert_weight(
    config=Path("./Llama-2-7b-chat-hf/config.json"),
    quantization=Quantization.from_name("q4f16_1"),
    model=Model.from_name("llama"),
    device=cpu_device(),
    source=Path("./Llama-2-7b-chat-hf/"),
    source_format="huggingface-safetensor",
    output=Path("./Llama-2-7b-chat-q4f16_1-MLC/"),
)
# Output logs:
#   Parameter size after quantization: 3.573 GB
#   Total parameters: 6,738,415,616
#   Bits per parameter: 4.240

GPU-Accelerated Quantization

from pathlib import Path
from tvm.runtime import cuda as cuda_device
from mlc_llm.interface.convert_weight import convert_weight

# Use GPU for faster quantization computation
convert_weight(
    config=Path("./Llama-2-7b-chat-hf/config.json"),
    quantization=quantization,
    model=model_type,
    device=cuda_device(0),
    source=Path("./Llama-2-7b-chat-hf/"),
    source_format="huggingface-safetensor",
    output=Path("./Llama-2-7b-chat-q4f16_1-MLC/"),
)

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Weight_Conversion_and_Quantization

Environment Links

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment