Implementation:Mlc ai Mlc llm Llama Loader
Overview
The Llama Loader module (python/mlc_llm/model/llama/llama_loader.py) defines parameter mappings for converting Llama model weights from HuggingFace and AWQ formats into MLC LLM's internal representation. It provides two functions: huggingface for standard HuggingFace weights and awq for pre-quantized AWQ weights.
Location
- File:
python/mlc_llm/model/llama/llama_loader.py - Lines: 172
- Module:
mlc_llm.model.llama
Function: huggingface
def huggingface(model_config: LlamaConfig, quantization: Quantization) -> ExternMapping:
Returns a parameter mapping from MLC LLM parameter names to HuggingFace PyTorch parameter names for the Llama architecture.
Parameters:
| Parameter | Type | Description |
|---|---|---|
model_config |
LlamaConfig |
The configuration of the Llama model. |
quantization |
Quantization |
The quantization configuration. |
Initialization
model = LlamaForCausalLM(model_config)
if quantization is not None:
model.to(quantization.model_dtype)
_, _named_params, _ = model.export_tvm(
spec=model.get_default_spec(),
allow_extern=True,
)
named_parameters = dict(_named_params)
Per-Layer Mappings
For each of the num_hidden_layers transformer layers, two fusions and one exclusion are registered:
QKV Projection Fusion
Separate Q, K, V projection weights from HuggingFace are concatenated along axis 0 into a single qkv_proj.weight:
attn = f"model.layers.{i}.self_attn"
mlc_name = f"{attn}.qkv_proj.weight"
mapping.add_mapping(
mlc_name,
[
f"{attn}.q_proj.weight",
f"{attn}.k_proj.weight",
f"{attn}.v_proj.weight",
],
functools.partial(
lambda q, k, v, dtype: np.concatenate([q, k, v], axis=0).astype(dtype),
dtype=mlc_param.dtype,
),
)
MLP Gate-Up Fusion
Separate gate and up projection weights are concatenated along axis 0 into a single gate_up_proj.weight:
mlp = f"model.layers.{i}.mlp"
mlc_name = f"{mlp}.gate_up_proj.weight"
mapping.add_mapping(
mlc_name,
[
f"{mlp}.gate_proj.weight",
f"{mlp}.up_proj.weight",
],
functools.partial(
lambda gate, up, dtype: np.concatenate([gate, up], axis=0).astype(dtype),
dtype=mlc_param.dtype,
),
)
Unused Parameters
The rotary embedding inverse frequency tensor is marked as unused since MLC LLM computes rotary embeddings differently:
mapping.add_unused(f"{attn}.rotary_emb.inv_freq")
Identity Fallback
All remaining parameters (embedding, layer norms, output head, etc.) are mapped with a simple dtype cast:
for mlc_name, mlc_param in named_parameters.items():
if mlc_name not in mapping.param_map:
mapping.add_mapping(
mlc_name,
[mlc_name],
functools.partial(
lambda x, dtype: x.astype(dtype),
dtype=mlc_param.dtype,
),
)
Function: awq
def awq(model_config: LlamaConfig, quantization: Quantization) -> ExternMapping:
Returns a parameter mapping from MLC LLM parameter names to AWQ pre-quantized parameter names.
Initialization
Unlike the HuggingFace path, the AWQ function uses awq_quant to create a quantized model instance:
model, _ = awq_quant(model_config, quantization)
_, _named_params, _ = model.export_tvm(
spec=model.get_default_spec(),
allow_extern=True,
)
named_parameters = dict(_named_params)
Quantized Parameter Fusion
For each layer, the AWQ function maps three quantization-specific suffixes (qweight, qzeros, scales) for both the QKV attention and gate-up MLP projections. The concatenation uses axis=1 instead of axis=0, because AWQ GEMM transposes the weight matrix:
for quantize_suffix in ["qweight", "qzeros", "scales"]:
mlc_name = f"{attn}.qkv_proj.{quantize_suffix}"
mapping.add_mapping(
mlc_name,
[
f"{attn}.q_proj.{quantize_suffix}",
f"{attn}.k_proj.{quantize_suffix}",
f"{attn}.v_proj.{quantize_suffix}",
],
functools.partial(
lambda q, k, v, dtype: np.concatenate(
[q, k, v],
axis=1, # AWQ GEMM would transpose the weight
).astype(dtype),
dtype=mlc_param.dtype,
),
)
The same pattern applies to MLP gate/up fusion with AWQ parameters concatenated along axis 1.
Key Design Decisions
- QKV fusion: Fusing Q, K, V into a single projection reduces kernel launch overhead during inference.
- AWQ axis difference: AWQ stores weights in transposed form, so concatenation happens on axis 1 rather than axis 0.
- Unused rotary embeddings: MLC LLM computes rotary position embeddings at runtime rather than storing precomputed inverse frequencies.
Dependencies
functools-- forfunctools.partialnumpy-- for array concatenation and dtype castingmlc_llm.loader.ExternMapping-- the core mapping data structuremlc_llm.quantization.Quantization-- quantization configuration.llama_model.LlamaConfig,.llama_model.LlamaForCausalLM-- Llama model definitions.llama_quantization.awq_quant-- AWQ quantization utility