Implementation:Mlc ai Mlc llm OLMo Loader
Overview
The OLMo Loader module (python/mlc_llm/model/olmo/olmo_loader.py) defines parameter mappings for converting OLMo (Open Language Model) weights from HuggingFace and AWQ formats into MLC LLM's internal representation. The structure closely mirrors the Llama loader, providing both huggingface and awq functions for standard and pre-quantized weight formats respectively.
Location
- File:
python/mlc_llm/model/olmo/olmo_loader.py - Lines: 172
- Module:
mlc_llm.model.olmo
Function: huggingface
def huggingface(model_config: OLMoConfig, quantization: Quantization) -> ExternMapping:
Returns a parameter mapping from MLC LLM parameter names to HuggingFace PyTorch parameter names for the OLMo architecture.
Parameters:
| Parameter | Type | Description |
|---|---|---|
model_config |
OLMoConfig |
The configuration of the OLMo model. |
quantization |
Quantization |
The quantization configuration. |
Initialization
model = OLMoForCausalLM(model_config)
if quantization is not None:
model.to(quantization.model_dtype)
_, _named_params, _ = model.export_tvm(
spec=model.get_default_spec(),
allow_extern=True,
)
named_parameters = dict(_named_params)
Per-Layer Mappings
For each of the num_hidden_layers transformer layers:
QKV Projection Fusion
Q, K, V projection weights are concatenated along axis 0 into a single qkv_proj.weight:
attn = f"model.layers.{i}.self_attn"
mlc_name = f"{attn}.qkv_proj.weight"
mapping.add_mapping(
mlc_name,
[
f"{attn}.q_proj.weight",
f"{attn}.k_proj.weight",
f"{attn}.v_proj.weight",
],
functools.partial(
lambda q, k, v, dtype: np.concatenate([q, k, v], axis=0).astype(dtype),
dtype=mlc_param.dtype,
),
)
MLP Gate-Up Fusion
Gate and up projection weights are concatenated along axis 0:
mlp = f"model.layers.{i}.mlp"
mlc_name = f"{mlp}.gate_up_proj.weight"
mapping.add_mapping(
mlc_name,
[
f"{mlp}.gate_proj.weight",
f"{mlp}.up_proj.weight",
],
functools.partial(
lambda gate, up, dtype: np.concatenate([gate, up], axis=0).astype(dtype),
dtype=mlc_param.dtype,
),
)
Unused Parameters
mapping.add_unused(f"{attn}.rotary_emb.inv_freq")
Identity Fallback
Remaining parameters are mapped with a dtype cast:
for mlc_name, mlc_param in named_parameters.items():
if mlc_name not in mapping.param_map:
mapping.add_mapping(
mlc_name,
[mlc_name],
functools.partial(
lambda x, dtype: x.astype(dtype),
dtype=mlc_param.dtype,
),
)
Function: awq
def awq(model_config: OLMoConfig, quantization: Quantization) -> ExternMapping:
Returns a parameter mapping from MLC LLM parameter names to AWQ pre-quantized parameter names.
Initialization
model, _ = awq_quant(model_config, quantization)
_, _named_params, _ = model.export_tvm(
spec=model.get_default_spec(),
allow_extern=True,
)
named_parameters = dict(_named_params)
Quantized Parameter Fusion
For each layer, three AWQ-specific suffixes (qweight, qzeros, scales) are mapped for both QKV attention and gate-up MLP projections. Concatenation uses axis=1 because AWQ GEMM transposes the weight:
for quantize_suffix in ["qweight", "qzeros", "scales"]:
mlc_name = f"{attn}.qkv_proj.{quantize_suffix}"
mapping.add_mapping(
mlc_name,
[
f"{attn}.q_proj.{quantize_suffix}",
f"{attn}.k_proj.{quantize_suffix}",
f"{attn}.v_proj.{quantize_suffix}",
],
functools.partial(
lambda q, k, v, dtype: np.concatenate(
[q, k, v],
axis=1, # AWQ GEMM would transpose the weight
).astype(dtype),
dtype=mlc_param.dtype,
),
)
The MLP gate-up AWQ fusion follows the same pattern with axis=1:
for quantize_suffix in ["qweight", "qzeros", "scales"]:
mlc_name = f"{mlp}.gate_up_proj.{quantize_suffix}"
mapping.add_mapping(
mlc_name,
[
f"{mlp}.gate_proj.{quantize_suffix}",
f"{mlp}.up_proj.{quantize_suffix}",
],
functools.partial(
lambda gate, up, dtype: np.concatenate(
[gate, up],
axis=1, # AWQ GEMM would transpose the weight
).astype(dtype),
dtype=mlc_param.dtype,
),
)
Comparison with Llama Loader
The OLMo loader is structurally identical to the Llama loader. Both models share the same architecture pattern (Llama-style decoder) with the same parameter fusion strategy:
| Aspect | OLMo | Llama |
|---|---|---|
| QKV fusion (HF) | axis=0 | axis=0 |
| QKV fusion (AWQ) | axis=1 | axis=1 |
| Gate-up fusion (HF) | axis=0 | axis=0 |
| Gate-up fusion (AWQ) | axis=1 | axis=1 |
| Rotary emb handling | Marked unused | Marked unused |
export_tvm return |
3 elements (both paths) | 3 elements (both paths) |
The only differences are the model class names (OLMoForCausalLM vs LlamaForCausalLM), configuration classes, and the quantization utility import.
Dependencies
functools-- forfunctools.partialnumpy-- for array concatenation and dtype castingmlc_llm.loader.ExternMapping-- the core mapping data structuremlc_llm.quantization.Quantization-- quantization configuration.olmo_model.OLMoConfig,.olmo_model.OLMoForCausalLM-- OLMo model definitions.olmo_quantization.awq_quant-- AWQ quantization utility