Implementation:Mlc ai Mlc llm OLMo Loader

Overview

The OLMo Loader module (python/mlc_llm/model/olmo/olmo_loader.py) defines parameter mappings for converting OLMo (Open Language Model) weights from HuggingFace and AWQ formats into MLC LLM's internal representation. The structure closely mirrors the Llama loader, providing both huggingface and awq functions for standard and pre-quantized weight formats respectively.

Location

File: python/mlc_llm/model/olmo/olmo_loader.py
Lines: 172
Module: mlc_llm.model.olmo

Function: huggingface

def huggingface(model_config: OLMoConfig, quantization: Quantization) -> ExternMapping:

Returns a parameter mapping from MLC LLM parameter names to HuggingFace PyTorch parameter names for the OLMo architecture.

Parameters:

Parameter	Type	Description
`model_config`	`OLMoConfig`	The configuration of the OLMo model.
`quantization`	`Quantization`	The quantization configuration.

Initialization

model = OLMoForCausalLM(model_config)
if quantization is not None:
    model.to(quantization.model_dtype)
_, _named_params, _ = model.export_tvm(
    spec=model.get_default_spec(),
    allow_extern=True,
)
named_parameters = dict(_named_params)

Per-Layer Mappings

For each of the num_hidden_layers transformer layers:

QKV Projection Fusion

Q, K, V projection weights are concatenated along axis 0 into a single qkv_proj.weight:

attn = f"model.layers.{i}.self_attn"
mlc_name = f"{attn}.qkv_proj.weight"
mapping.add_mapping(
    mlc_name,
    [
        f"{attn}.q_proj.weight",
        f"{attn}.k_proj.weight",
        f"{attn}.v_proj.weight",
    ],
    functools.partial(
        lambda q, k, v, dtype: np.concatenate([q, k, v], axis=0).astype(dtype),
        dtype=mlc_param.dtype,
    ),
)

MLP Gate-Up Fusion

Gate and up projection weights are concatenated along axis 0:

mlp = f"model.layers.{i}.mlp"
mlc_name = f"{mlp}.gate_up_proj.weight"
mapping.add_mapping(
    mlc_name,
    [
        f"{mlp}.gate_proj.weight",
        f"{mlp}.up_proj.weight",
    ],
    functools.partial(
        lambda gate, up, dtype: np.concatenate([gate, up], axis=0).astype(dtype),
        dtype=mlc_param.dtype,
    ),
)

Unused Parameters

mapping.add_unused(f"{attn}.rotary_emb.inv_freq")

Identity Fallback

Remaining parameters are mapped with a dtype cast:

for mlc_name, mlc_param in named_parameters.items():
    if mlc_name not in mapping.param_map:
        mapping.add_mapping(
            mlc_name,
            [mlc_name],
            functools.partial(
                lambda x, dtype: x.astype(dtype),
                dtype=mlc_param.dtype,
            ),
        )

Function: awq

def awq(model_config: OLMoConfig, quantization: Quantization) -> ExternMapping:

Returns a parameter mapping from MLC LLM parameter names to AWQ pre-quantized parameter names.

Initialization

model, _ = awq_quant(model_config, quantization)
_, _named_params, _ = model.export_tvm(
    spec=model.get_default_spec(),
    allow_extern=True,
)
named_parameters = dict(_named_params)

Quantized Parameter Fusion

For each layer, three AWQ-specific suffixes (qweight, qzeros, scales) are mapped for both QKV attention and gate-up MLP projections. Concatenation uses axis=1 because AWQ GEMM transposes the weight:

for quantize_suffix in ["qweight", "qzeros", "scales"]:
    mlc_name = f"{attn}.qkv_proj.{quantize_suffix}"
    mapping.add_mapping(
        mlc_name,
        [
            f"{attn}.q_proj.{quantize_suffix}",
            f"{attn}.k_proj.{quantize_suffix}",
            f"{attn}.v_proj.{quantize_suffix}",
        ],
        functools.partial(
            lambda q, k, v, dtype: np.concatenate(
                [q, k, v],
                axis=1,  # AWQ GEMM would transpose the weight
            ).astype(dtype),
            dtype=mlc_param.dtype,
        ),
    )

The MLP gate-up AWQ fusion follows the same pattern with axis=1:

for quantize_suffix in ["qweight", "qzeros", "scales"]:
    mlc_name = f"{mlp}.gate_up_proj.{quantize_suffix}"
    mapping.add_mapping(
        mlc_name,
        [
            f"{mlp}.gate_proj.{quantize_suffix}",
            f"{mlp}.up_proj.{quantize_suffix}",
        ],
        functools.partial(
            lambda gate, up, dtype: np.concatenate(
                [gate, up],
                axis=1,  # AWQ GEMM would transpose the weight
            ).astype(dtype),
            dtype=mlc_param.dtype,
        ),
    )

Comparison with Llama Loader

The OLMo loader is structurally identical to the Llama loader. Both models share the same architecture pattern (Llama-style decoder) with the same parameter fusion strategy:

Aspect	OLMo	Llama
QKV fusion (HF)	axis=0	axis=0
QKV fusion (AWQ)	axis=1	axis=1
Gate-up fusion (HF)	axis=0	axis=0
Gate-up fusion (AWQ)	axis=1	axis=1
Rotary emb handling	Marked unused	Marked unused
`export_tvm` return	3 elements (both paths)	3 elements (both paths)

The only differences are the model class names (OLMoForCausalLM vs LlamaForCausalLM), configuration classes, and the quantization utility import.

Dependencies

functools -- for functools.partial
numpy -- for array concatenation and dtype casting
mlc_llm.loader.ExternMapping -- the core mapping data structure
mlc_llm.quantization.Quantization -- quantization configuration
.olmo_model.OLMoConfig, .olmo_model.OLMoForCausalLM -- OLMo model definitions
.olmo_quantization.awq_quant -- AWQ quantization utility

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment