Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Qwen3 Loader

From Leeroopedia
Revision as of 15:51, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Mlc_ai_Mlc_llm_Qwen3_Loader.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Overview

The Qwen3 Loader module defines parameter mapping logic for converting Qwen3 model weights from HuggingFace format into MLC LLM's internal representation. It is located at python/mlc_llm/model/qwen3/qwen3_loader.py (135 lines).

A key feature of this loader is its support for BlockScaleQuantize, which enables loading FP8 block-quantized Qwen3 models along with their associated scale inverse tensors. The loader handles the concatenation of separate Q/K/V attention projections and gate/up MLP projections into their fused MLC equivalents.

Source File

  • File: python/mlc_llm/model/qwen3/qwen3_loader.py
  • Lines: 135
  • Module: mlc_llm.model.qwen3.qwen3_loader

Dependencies

Import Purpose
functools Used for functools.partial to bind dtype to transform lambdas
typing.Callable, typing.List Type annotations for the helper function
numpy Used for np.concatenate to fuse split weight tensors
mlc_llm.loader.ExternMapping Mapping class storing parameter name translations
mlc_llm.loader.QuantizeMapping Mapping used during block-scale quantization model conversion
mlc_llm.quantization.BlockScaleQuantize Block-scale quantization support for FP8 models
mlc_llm.quantization.Quantization Base quantization configuration
.qwen3_model.Qwen3Config, Qwen3LMHeadModel Qwen3 model configuration and model class

Function: huggingface

def huggingface(model_config: Qwen3Config, quantization: Quantization) -> ExternMapping:

Returns an ExternMapping that maps MLC LLM parameter names to HuggingFace PyTorch parameter names for Qwen3 models.

BlockScaleQuantize Support

The function includes special handling for FP8 block-quantized models:

if isinstance(quantization, BlockScaleQuantize):
    model = quantization.quantize_model(model, QuantizeMapping({}, {}), "")
    if model_config.weight_block_size is None:
        raise ValueError(
            "The input Qwen3 model is not fp8 block quantized. "
            "Thus BlockScaleQuantize is not supported."
        )

The loader validates that:

  • If BlockScaleQuantize is used, the model config must have weight_block_size set (indicating an FP8 block-quantized source model).
  • If the model is FP8 block-quantized (has weight_block_size), then BlockScaleQuantize must be used.

Helper: add_weight_and_scale_mapping

This helper function adds mappings for both the weight tensor and, when using BlockScaleQuantize, its associated scale inverse tensor:

def add_weight_and_scale_mapping(
    weight_mlc_name: str,
    weight_hf_names: List[str],
    weight_transform_func: Callable,
):
    mlc_param = named_parameters[weight_mlc_name]
    mapping.add_mapping(
        weight_mlc_name,
        weight_hf_names,
        functools.partial(weight_transform_func, dtype=mlc_param.dtype),
    )
    if isinstance(quantization, BlockScaleQuantize):
        scale_mlc_name = f"{weight_mlc_name}_scale_inv"
        if scale_mlc_name in named_parameters:
            scale_hf_names = [f"{name}_scale_inv" for name in weight_hf_names]
            scale_param = named_parameters[scale_mlc_name]
            mapping.add_mapping(
                scale_mlc_name,
                scale_hf_names,
                functools.partial(weight_transform_func, dtype=scale_param.dtype),
            )

The scale tensor naming follows a convention where _scale_inv is appended to the corresponding weight name.

Attention Weight Mapping

For each hidden layer, separate Q, K, and V projection weights are concatenated into a fused c_attn weight:

for i in range(model_config.num_hidden_layers):
    attn = f"model.layers.{i}.self_attn"
    add_weight_and_scale_mapping(
        f"{attn}.c_attn.weight",
        [
            f"{attn}.q_proj.weight",
            f"{attn}.k_proj.weight",
            f"{attn}.v_proj.weight",
        ],
        lambda q, k, v, dtype: np.concatenate([q, k, v], axis=0).astype(dtype),
    )

When model_config.attention_bias is True, the corresponding bias tensors are also concatenated.

MLP Weight Mapping

Gate and up projection weights are fused into gate_up_proj:

mlp = f"model.layers.{i}.mlp"
add_weight_and_scale_mapping(
    f"{mlp}.gate_up_proj.weight",
    [
        f"{mlp}.gate_proj.weight",
        f"{mlp}.up_proj.weight",
    ],
    lambda gate, up, dtype: np.concatenate([gate, up], axis=0).astype(dtype),
)

Identity Mapping for Remaining Parameters

Any parameters not explicitly mapped (such as layer norms, the output projection, and embedding weights) are identity-mapped with a dtype cast:

for mlc_name, mlc_param in named_parameters.items():
    if mlc_name not in mapping.param_map:
        mapping.add_mapping(
            mlc_name,
            [mlc_name],
            functools.partial(
                lambda x, dtype: x.astype(dtype),
                dtype=mlc_param.dtype,
            ),
        )

Parameter Mapping Summary

MLC LLM Name HuggingFace Name(s) Transform
model.layers.{i}.self_attn.c_attn.weight q_proj.weight, k_proj.weight, v_proj.weight concatenate + dtype cast
model.layers.{i}.self_attn.c_attn.bias q_proj.bias, k_proj.bias, v_proj.bias concatenate + dtype cast (if attention_bias)
model.layers.{i}.mlp.gate_up_proj.weight gate_proj.weight, up_proj.weight concatenate + dtype cast
All *_scale_inv tensors Corresponding *_scale_inv HF names same transform as base weight (BlockScaleQuantize only)
All other parameters Same name dtype cast (identity mapping)

Design Notes

  • The loader derives parameter names and dtypes from the actual model export via model.export_tvm(), ensuring consistency between the mapping and the model definition.
  • The allow_extern=True flag is passed to export_tvm, enabling external function calls in the TVM IR.
  • The FP8 block-scale quantization pathway converts the model before export, so the named parameters include scale tensors that need dedicated mappings.

Categories

  • Model Loading
  • Parameter Mapping
  • Qwen3 Architecture
  • Block-Scale Quantization
  • Weight Conversion

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment