Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Llama Loader

From Leeroopedia


Overview

The Llama Loader module (python/mlc_llm/model/llama/llama_loader.py) defines parameter mappings for converting Llama model weights from HuggingFace and AWQ formats into MLC LLM's internal representation. It provides two functions: huggingface for standard HuggingFace weights and awq for pre-quantized AWQ weights.

Location

  • File: python/mlc_llm/model/llama/llama_loader.py
  • Lines: 172
  • Module: mlc_llm.model.llama

Function: huggingface

def huggingface(model_config: LlamaConfig, quantization: Quantization) -> ExternMapping:

Returns a parameter mapping from MLC LLM parameter names to HuggingFace PyTorch parameter names for the Llama architecture.

Parameters:

Parameter Type Description
model_config LlamaConfig The configuration of the Llama model.
quantization Quantization The quantization configuration.

Initialization

model = LlamaForCausalLM(model_config)
if quantization is not None:
    model.to(quantization.model_dtype)
_, _named_params, _ = model.export_tvm(
    spec=model.get_default_spec(),
    allow_extern=True,
)
named_parameters = dict(_named_params)

Per-Layer Mappings

For each of the num_hidden_layers transformer layers, two fusions and one exclusion are registered:

QKV Projection Fusion

Separate Q, K, V projection weights from HuggingFace are concatenated along axis 0 into a single qkv_proj.weight:

attn = f"model.layers.{i}.self_attn"
mlc_name = f"{attn}.qkv_proj.weight"
mapping.add_mapping(
    mlc_name,
    [
        f"{attn}.q_proj.weight",
        f"{attn}.k_proj.weight",
        f"{attn}.v_proj.weight",
    ],
    functools.partial(
        lambda q, k, v, dtype: np.concatenate([q, k, v], axis=0).astype(dtype),
        dtype=mlc_param.dtype,
    ),
)

MLP Gate-Up Fusion

Separate gate and up projection weights are concatenated along axis 0 into a single gate_up_proj.weight:

mlp = f"model.layers.{i}.mlp"
mlc_name = f"{mlp}.gate_up_proj.weight"
mapping.add_mapping(
    mlc_name,
    [
        f"{mlp}.gate_proj.weight",
        f"{mlp}.up_proj.weight",
    ],
    functools.partial(
        lambda gate, up, dtype: np.concatenate([gate, up], axis=0).astype(dtype),
        dtype=mlc_param.dtype,
    ),
)

Unused Parameters

The rotary embedding inverse frequency tensor is marked as unused since MLC LLM computes rotary embeddings differently:

mapping.add_unused(f"{attn}.rotary_emb.inv_freq")

Identity Fallback

All remaining parameters (embedding, layer norms, output head, etc.) are mapped with a simple dtype cast:

for mlc_name, mlc_param in named_parameters.items():
    if mlc_name not in mapping.param_map:
        mapping.add_mapping(
            mlc_name,
            [mlc_name],
            functools.partial(
                lambda x, dtype: x.astype(dtype),
                dtype=mlc_param.dtype,
            ),
        )

Function: awq

def awq(model_config: LlamaConfig, quantization: Quantization) -> ExternMapping:

Returns a parameter mapping from MLC LLM parameter names to AWQ pre-quantized parameter names.

Initialization

Unlike the HuggingFace path, the AWQ function uses awq_quant to create a quantized model instance:

model, _ = awq_quant(model_config, quantization)
_, _named_params, _ = model.export_tvm(
    spec=model.get_default_spec(),
    allow_extern=True,
)
named_parameters = dict(_named_params)

Quantized Parameter Fusion

For each layer, the AWQ function maps three quantization-specific suffixes (qweight, qzeros, scales) for both the QKV attention and gate-up MLP projections. The concatenation uses axis=1 instead of axis=0, because AWQ GEMM transposes the weight matrix:

for quantize_suffix in ["qweight", "qzeros", "scales"]:
    mlc_name = f"{attn}.qkv_proj.{quantize_suffix}"
    mapping.add_mapping(
        mlc_name,
        [
            f"{attn}.q_proj.{quantize_suffix}",
            f"{attn}.k_proj.{quantize_suffix}",
            f"{attn}.v_proj.{quantize_suffix}",
        ],
        functools.partial(
            lambda q, k, v, dtype: np.concatenate(
                [q, k, v],
                axis=1,  # AWQ GEMM would transpose the weight
            ).astype(dtype),
            dtype=mlc_param.dtype,
        ),
    )

The same pattern applies to MLP gate/up fusion with AWQ parameters concatenated along axis 1.

Key Design Decisions

  • QKV fusion: Fusing Q, K, V into a single projection reduces kernel launch overhead during inference.
  • AWQ axis difference: AWQ stores weights in transposed form, so concatenation happens on axis 1 rather than axis 0.
  • Unused rotary embeddings: MLC LLM computes rotary position embeddings at runtime rather than storing precomputed inverse frequencies.

Dependencies

  • functools -- for functools.partial
  • numpy -- for array concatenation and dtype casting
  • mlc_llm.loader.ExternMapping -- the core mapping data structure
  • mlc_llm.quantization.Quantization -- quantization configuration
  • .llama_model.LlamaConfig, .llama_model.LlamaForCausalLM -- Llama model definitions
  • .llama_quantization.awq_quant -- AWQ quantization utility

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment