Implementation:Mlc ai Mlc llm Qwen3 Loader
Overview
The Qwen3 Loader module defines parameter mapping logic for converting Qwen3 model weights from HuggingFace format into MLC LLM's internal representation. It is located at python/mlc_llm/model/qwen3/qwen3_loader.py (135 lines).
A key feature of this loader is its support for BlockScaleQuantize, which enables loading FP8 block-quantized Qwen3 models along with their associated scale inverse tensors. The loader handles the concatenation of separate Q/K/V attention projections and gate/up MLP projections into their fused MLC equivalents.
Source File
- File:
python/mlc_llm/model/qwen3/qwen3_loader.py - Lines: 135
- Module:
mlc_llm.model.qwen3.qwen3_loader
Dependencies
| Import | Purpose |
|---|---|
functools |
Used for functools.partial to bind dtype to transform lambdas
|
typing.Callable, typing.List |
Type annotations for the helper function |
numpy |
Used for np.concatenate to fuse split weight tensors
|
mlc_llm.loader.ExternMapping |
Mapping class storing parameter name translations |
mlc_llm.loader.QuantizeMapping |
Mapping used during block-scale quantization model conversion |
mlc_llm.quantization.BlockScaleQuantize |
Block-scale quantization support for FP8 models |
mlc_llm.quantization.Quantization |
Base quantization configuration |
.qwen3_model.Qwen3Config, Qwen3LMHeadModel |
Qwen3 model configuration and model class |
Function: huggingface
def huggingface(model_config: Qwen3Config, quantization: Quantization) -> ExternMapping:
Returns an ExternMapping that maps MLC LLM parameter names to HuggingFace PyTorch parameter names for Qwen3 models.
BlockScaleQuantize Support
The function includes special handling for FP8 block-quantized models:
if isinstance(quantization, BlockScaleQuantize):
model = quantization.quantize_model(model, QuantizeMapping({}, {}), "")
if model_config.weight_block_size is None:
raise ValueError(
"The input Qwen3 model is not fp8 block quantized. "
"Thus BlockScaleQuantize is not supported."
)
The loader validates that:
- If
BlockScaleQuantizeis used, the model config must haveweight_block_sizeset (indicating an FP8 block-quantized source model). - If the model is FP8 block-quantized (has
weight_block_size), thenBlockScaleQuantizemust be used.
Helper: add_weight_and_scale_mapping
This helper function adds mappings for both the weight tensor and, when using BlockScaleQuantize, its associated scale inverse tensor:
def add_weight_and_scale_mapping(
weight_mlc_name: str,
weight_hf_names: List[str],
weight_transform_func: Callable,
):
mlc_param = named_parameters[weight_mlc_name]
mapping.add_mapping(
weight_mlc_name,
weight_hf_names,
functools.partial(weight_transform_func, dtype=mlc_param.dtype),
)
if isinstance(quantization, BlockScaleQuantize):
scale_mlc_name = f"{weight_mlc_name}_scale_inv"
if scale_mlc_name in named_parameters:
scale_hf_names = [f"{name}_scale_inv" for name in weight_hf_names]
scale_param = named_parameters[scale_mlc_name]
mapping.add_mapping(
scale_mlc_name,
scale_hf_names,
functools.partial(weight_transform_func, dtype=scale_param.dtype),
)
The scale tensor naming follows a convention where _scale_inv is appended to the corresponding weight name.
Attention Weight Mapping
For each hidden layer, separate Q, K, and V projection weights are concatenated into a fused c_attn weight:
for i in range(model_config.num_hidden_layers):
attn = f"model.layers.{i}.self_attn"
add_weight_and_scale_mapping(
f"{attn}.c_attn.weight",
[
f"{attn}.q_proj.weight",
f"{attn}.k_proj.weight",
f"{attn}.v_proj.weight",
],
lambda q, k, v, dtype: np.concatenate([q, k, v], axis=0).astype(dtype),
)
When model_config.attention_bias is True, the corresponding bias tensors are also concatenated.
MLP Weight Mapping
Gate and up projection weights are fused into gate_up_proj:
mlp = f"model.layers.{i}.mlp"
add_weight_and_scale_mapping(
f"{mlp}.gate_up_proj.weight",
[
f"{mlp}.gate_proj.weight",
f"{mlp}.up_proj.weight",
],
lambda gate, up, dtype: np.concatenate([gate, up], axis=0).astype(dtype),
)
Identity Mapping for Remaining Parameters
Any parameters not explicitly mapped (such as layer norms, the output projection, and embedding weights) are identity-mapped with a dtype cast:
for mlc_name, mlc_param in named_parameters.items():
if mlc_name not in mapping.param_map:
mapping.add_mapping(
mlc_name,
[mlc_name],
functools.partial(
lambda x, dtype: x.astype(dtype),
dtype=mlc_param.dtype,
),
)
Parameter Mapping Summary
| MLC LLM Name | HuggingFace Name(s) | Transform |
|---|---|---|
model.layers.{i}.self_attn.c_attn.weight |
q_proj.weight, k_proj.weight, v_proj.weight |
concatenate + dtype cast |
model.layers.{i}.self_attn.c_attn.bias |
q_proj.bias, k_proj.bias, v_proj.bias |
concatenate + dtype cast (if attention_bias) |
model.layers.{i}.mlp.gate_up_proj.weight |
gate_proj.weight, up_proj.weight |
concatenate + dtype cast |
All *_scale_inv tensors |
Corresponding *_scale_inv HF names |
same transform as base weight (BlockScaleQuantize only) |
| All other parameters | Same name | dtype cast (identity mapping) |
Design Notes
- The loader derives parameter names and dtypes from the actual model export via
model.export_tvm(), ensuring consistency between the mapping and the model definition. - The
allow_extern=Trueflag is passed toexport_tvm, enabling external function calls in the TVM IR. - The FP8 block-scale quantization pathway converts the model before export, so the named parameters include scale tensors that need dedicated mappings.
Categories
- Model Loading
- Parameter Mapping
- Qwen3 Architecture
- Block-Scale Quantization
- Weight Conversion