Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Mlc llm Convert weight

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Deployment, Model_Optimization
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for converting model weights from training format to optimized MLC inference format with quantization, provided by MLC-LLM.

Description

The convert_weight function is MLC-LLM's weight conversion and quantization entrypoint. It reads source model weights in various formats (HuggingFace safetensors, PyTorch bin, GGUF, AWQ), applies the specified quantization scheme, validates each converted parameter against the expected model shape and dtype, and writes the quantized weights to a TVM tensor cache directory. The conversion pipeline:

  1. Loads the model configuration from the provided config path and creates the quantized model definition.
  2. Exports the model to TVM IR to determine the expected parameter names, shapes, and dtypes.
  3. Optionally applies pre-sharding for tensor-parallel deployments (controlled by the MLC_INTERNAL_PRESHARD_NUM environment variable).
  4. Uses a format-specific loader (registered in mlc_llm.loader.LOADER) to stream source weights, applying quantization transformations on the fly.
  5. Validates each parameter's shape and dtype against the expected values and raises errors on mismatch.
  6. Writes parameters to disk using tvmjs.dump_tensor_cache with f32-to-bf16 encoding.
  7. Reports statistics including total parameter size after quantization (in GB), total parameter count, and bits per parameter.

Usage

Use this function as the third step of the MLC-LLM compilation pipeline, after generating the deployment configuration with gen_config. It is required whenever you need to produce quantized weight files for a new model or a new quantization setting.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/interface/convert_weight.py (lines 169-181)

Signature

def convert_weight(
    config: Path,
    quantization: Quantization,
    model: Model,
    device: Device,
    source: Path,
    source_format: str,
    output: Path,
):
    """MLC LLM's weight conversation and quantization flow."""

Import

from mlc_llm.interface.convert_weight import convert_weight

I/O Contract

Inputs

Name Type Required Description
config Path Yes Path to the model's config.json file. Used to load the model architecture definition that determines expected parameter shapes and dtypes.
quantization Quantization Yes The quantization scheme object (e.g., q4f16_1, q3f16_0, q0f16). Determines the quantization algorithm, bit width, and group size applied to eligible parameters.
model Model Yes The MLC model descriptor object that provides the model class, quantization methods, and source format mappings. Obtained from the MLC model registry.
device Device Yes The TVM device to use during quantization computation (e.g., cpu(), cuda(0)). Quantization kernels run on this device before results are copied back to CPU for storage.
source Path Yes Path to the directory containing source model weights in the format specified by source_format.
source_format str Yes The format of the source weights. Supported values include "huggingface-torch", "huggingface-safetensor", "gguf", and "awq".
output Path Yes Path to the output directory where quantized weight files (TVM tensor cache) will be written. The directory will contain binary shard files and a tensor-cache.json manifest.

Outputs

Name Type Description
return value None The function returns nothing. Side effects include writing the quantized weight tensor cache to the output directory and logging conversion statistics (parameter size in GB, total parameters, bits per parameter).

Exceptions

Exception Condition
ValueError Raised when a converted parameter's shape or dtype does not match the model's expected values, when a duplicate parameter is encountered, or when a required parameter is missing from the source weights.
NotImplementedError Raised when ft-quant quantization is requested with tensor parallelism (tensor_parallel_shards > 1), which is not yet supported.

Usage Examples

Basic Usage

from pathlib import Path
from tvm.runtime import cpu as cpu_device
from mlc_llm.interface.convert_weight import convert_weight
from mlc_llm.model import Model
from mlc_llm.quantization import Quantization

# Convert HuggingFace model weights to MLC format with INT4 quantization
convert_weight(
    config=Path("./Llama-2-7b-chat-hf/config.json"),
    quantization=Quantization.from_name("q4f16_1"),
    model=Model.from_name("llama"),
    device=cpu_device(),
    source=Path("./Llama-2-7b-chat-hf/"),
    source_format="huggingface-safetensor",
    output=Path("./Llama-2-7b-chat-q4f16_1-MLC/"),
)
# Output logs:
#   Parameter size after quantization: 3.573 GB
#   Total parameters: 6,738,415,616
#   Bits per parameter: 4.240

GPU-Accelerated Quantization

from pathlib import Path
from tvm.runtime import cuda as cuda_device
from mlc_llm.interface.convert_weight import convert_weight

# Use GPU for faster quantization computation
convert_weight(
    config=Path("./Llama-2-7b-chat-hf/config.json"),
    quantization=quantization,
    model=model_type,
    device=cuda_device(0),
    source=Path("./Llama-2-7b-chat-hf/"),
    source_format="huggingface-safetensor",
    output=Path("./Llama-2-7b-chat-q4f16_1-MLC/"),
)

Related Pages

Implements Principle

Environment Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment