Implementation:Mlc ai Mlc llm Convert weight
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Deployment, Model_Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for converting model weights from training format to optimized MLC inference format with quantization, provided by MLC-LLM.
Description
The convert_weight function is MLC-LLM's weight conversion and quantization entrypoint. It reads source model weights in various formats (HuggingFace safetensors, PyTorch bin, GGUF, AWQ), applies the specified quantization scheme, validates each converted parameter against the expected model shape and dtype, and writes the quantized weights to a TVM tensor cache directory. The conversion pipeline:
- Loads the model configuration from the provided config path and creates the quantized model definition.
- Exports the model to TVM IR to determine the expected parameter names, shapes, and dtypes.
- Optionally applies pre-sharding for tensor-parallel deployments (controlled by the
MLC_INTERNAL_PRESHARD_NUMenvironment variable). - Uses a format-specific loader (registered in
mlc_llm.loader.LOADER) to stream source weights, applying quantization transformations on the fly. - Validates each parameter's shape and dtype against the expected values and raises errors on mismatch.
- Writes parameters to disk using
tvmjs.dump_tensor_cachewithf32-to-bf16encoding. - Reports statistics including total parameter size after quantization (in GB), total parameter count, and bits per parameter.
Usage
Use this function as the third step of the MLC-LLM compilation pipeline, after generating the deployment configuration with gen_config. It is required whenever you need to produce quantized weight files for a new model or a new quantization setting.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/interface/convert_weight.py(lines 169-181)
Signature
def convert_weight(
config: Path,
quantization: Quantization,
model: Model,
device: Device,
source: Path,
source_format: str,
output: Path,
):
"""MLC LLM's weight conversation and quantization flow."""
Import
from mlc_llm.interface.convert_weight import convert_weight
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | Path | Yes | Path to the model's config.json file. Used to load the model architecture definition that determines expected parameter shapes and dtypes.
|
| quantization | Quantization | Yes | The quantization scheme object (e.g., q4f16_1, q3f16_0, q0f16). Determines the quantization algorithm, bit width, and group size applied to eligible parameters. |
| model | Model | Yes | The MLC model descriptor object that provides the model class, quantization methods, and source format mappings. Obtained from the MLC model registry. |
| device | Device | Yes | The TVM device to use during quantization computation (e.g., cpu(), cuda(0)). Quantization kernels run on this device before results are copied back to CPU for storage.
|
| source | Path | Yes | Path to the directory containing source model weights in the format specified by source_format.
|
| source_format | str | Yes | The format of the source weights. Supported values include "huggingface-torch", "huggingface-safetensor", "gguf", and "awq".
|
| output | Path | Yes | Path to the output directory where quantized weight files (TVM tensor cache) will be written. The directory will contain binary shard files and a tensor-cache.json manifest.
|
Outputs
| Name | Type | Description |
|---|---|---|
| return value | None | The function returns nothing. Side effects include writing the quantized weight tensor cache to the output directory and logging conversion statistics (parameter size in GB, total parameters, bits per parameter).
|
Exceptions
| Exception | Condition |
|---|---|
| ValueError | Raised when a converted parameter's shape or dtype does not match the model's expected values, when a duplicate parameter is encountered, or when a required parameter is missing from the source weights. |
| NotImplementedError | Raised when ft-quant quantization is requested with tensor parallelism (tensor_parallel_shards > 1), which is not yet supported. |
Usage Examples
Basic Usage
from pathlib import Path
from tvm.runtime import cpu as cpu_device
from mlc_llm.interface.convert_weight import convert_weight
from mlc_llm.model import Model
from mlc_llm.quantization import Quantization
# Convert HuggingFace model weights to MLC format with INT4 quantization
convert_weight(
config=Path("./Llama-2-7b-chat-hf/config.json"),
quantization=Quantization.from_name("q4f16_1"),
model=Model.from_name("llama"),
device=cpu_device(),
source=Path("./Llama-2-7b-chat-hf/"),
source_format="huggingface-safetensor",
output=Path("./Llama-2-7b-chat-q4f16_1-MLC/"),
)
# Output logs:
# Parameter size after quantization: 3.573 GB
# Total parameters: 6,738,415,616
# Bits per parameter: 4.240
GPU-Accelerated Quantization
from pathlib import Path
from tvm.runtime import cuda as cuda_device
from mlc_llm.interface.convert_weight import convert_weight
# Use GPU for faster quantization computation
convert_weight(
config=Path("./Llama-2-7b-chat-hf/config.json"),
quantization=quantization,
model=model_type,
device=cuda_device(0),
source=Path("./Llama-2-7b-chat-hf/"),
source_format="huggingface-safetensor",
output=Path("./Llama-2-7b-chat-q4f16_1-MLC/"),
)