Implementation:Mlc ai Mlc llm Qwen3 Loader

Overview

The Qwen3 Loader module defines parameter mapping logic for converting Qwen3 model weights from HuggingFace format into MLC LLM's internal representation. It is located at python/mlc_llm/model/qwen3/qwen3_loader.py (135 lines).

A key feature of this loader is its support for BlockScaleQuantize, which enables loading FP8 block-quantized Qwen3 models along with their associated scale inverse tensors. The loader handles the concatenation of separate Q/K/V attention projections and gate/up MLP projections into their fused MLC equivalents.

Source File

File: python/mlc_llm/model/qwen3/qwen3_loader.py
Lines: 135
Module: mlc_llm.model.qwen3.qwen3_loader

Dependencies

Import	Purpose
`functools`	Used for `functools.partial` to bind dtype to transform lambdas
`typing.Callable`, `typing.List`	Type annotations for the helper function
`numpy`	Used for `np.concatenate` to fuse split weight tensors
`mlc_llm.loader.ExternMapping`	Mapping class storing parameter name translations
`mlc_llm.loader.QuantizeMapping`	Mapping used during block-scale quantization model conversion
`mlc_llm.quantization.BlockScaleQuantize`	Block-scale quantization support for FP8 models
`mlc_llm.quantization.Quantization`	Base quantization configuration
`.qwen3_model.Qwen3Config`, `Qwen3LMHeadModel`	Qwen3 model configuration and model class

Function: huggingface

def huggingface(model_config: Qwen3Config, quantization: Quantization) -> ExternMapping:

Returns an ExternMapping that maps MLC LLM parameter names to HuggingFace PyTorch parameter names for Qwen3 models.

BlockScaleQuantize Support

The function includes special handling for FP8 block-quantized models:

if isinstance(quantization, BlockScaleQuantize):
    model = quantization.quantize_model(model, QuantizeMapping({}, {}), "")
    if model_config.weight_block_size is None:
        raise ValueError(
            "The input Qwen3 model is not fp8 block quantized. "
            "Thus BlockScaleQuantize is not supported."
        )

The loader validates that:

If BlockScaleQuantize is used, the model config must have weight_block_size set (indicating an FP8 block-quantized source model).
If the model is FP8 block-quantized (has weight_block_size), then BlockScaleQuantize must be used.

Helper: add_weight_and_scale_mapping

This helper function adds mappings for both the weight tensor and, when using BlockScaleQuantize, its associated scale inverse tensor:

def add_weight_and_scale_mapping(
    weight_mlc_name: str,
    weight_hf_names: List[str],
    weight_transform_func: Callable,
):
    mlc_param = named_parameters[weight_mlc_name]
    mapping.add_mapping(
        weight_mlc_name,
        weight_hf_names,
        functools.partial(weight_transform_func, dtype=mlc_param.dtype),
    )
    if isinstance(quantization, BlockScaleQuantize):
        scale_mlc_name = f"{weight_mlc_name}_scale_inv"
        if scale_mlc_name in named_parameters:
            scale_hf_names = [f"{name}_scale_inv" for name in weight_hf_names]
            scale_param = named_parameters[scale_mlc_name]
            mapping.add_mapping(
                scale_mlc_name,
                scale_hf_names,
                functools.partial(weight_transform_func, dtype=scale_param.dtype),
            )

The scale tensor naming follows a convention where _scale_inv is appended to the corresponding weight name.

Attention Weight Mapping

For each hidden layer, separate Q, K, and V projection weights are concatenated into a fused c_attn weight:

for i in range(model_config.num_hidden_layers):
    attn = f"model.layers.{i}.self_attn"
    add_weight_and_scale_mapping(
        f"{attn}.c_attn.weight",
        [
            f"{attn}.q_proj.weight",
            f"{attn}.k_proj.weight",
            f"{attn}.v_proj.weight",
        ],
        lambda q, k, v, dtype: np.concatenate([q, k, v], axis=0).astype(dtype),
    )

When model_config.attention_bias is True, the corresponding bias tensors are also concatenated.

MLP Weight Mapping

Gate and up projection weights are fused into gate_up_proj:

mlp = f"model.layers.{i}.mlp"
add_weight_and_scale_mapping(
    f"{mlp}.gate_up_proj.weight",
    [
        f"{mlp}.gate_proj.weight",
        f"{mlp}.up_proj.weight",
    ],
    lambda gate, up, dtype: np.concatenate([gate, up], axis=0).astype(dtype),
)

Identity Mapping for Remaining Parameters

Any parameters not explicitly mapped (such as layer norms, the output projection, and embedding weights) are identity-mapped with a dtype cast:

for mlc_name, mlc_param in named_parameters.items():
    if mlc_name not in mapping.param_map:
        mapping.add_mapping(
            mlc_name,
            [mlc_name],
            functools.partial(
                lambda x, dtype: x.astype(dtype),
                dtype=mlc_param.dtype,
            ),
        )

Parameter Mapping Summary

MLC LLM Name	HuggingFace Name(s)	Transform
`model.layers.{i}.self_attn.c_attn.weight`	`q_proj.weight`, `k_proj.weight`, `v_proj.weight`	concatenate + dtype cast
`model.layers.{i}.self_attn.c_attn.bias`	`q_proj.bias`, `k_proj.bias`, `v_proj.bias`	concatenate + dtype cast (if attention_bias)
`model.layers.{i}.mlp.gate_up_proj.weight`	`gate_proj.weight`, `up_proj.weight`	concatenate + dtype cast
All `*_scale_inv` tensors	Corresponding `*_scale_inv` HF names	same transform as base weight (BlockScaleQuantize only)
All other parameters	Same name	dtype cast (identity mapping)

Design Notes

The loader derives parameter names and dtypes from the actual model export via model.export_tvm(), ensuring consistency between the mapping and the model definition.
The allow_extern=True flag is passed to export_tvm, enabling external function calls in the TVM IR.
The FP8 block-scale quantization pathway converts the model before export, so the named parameters include scale tensors that need dedicated mappings.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment