Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm OLMo Loader

From Leeroopedia


Overview

The OLMo Loader module (python/mlc_llm/model/olmo/olmo_loader.py) defines parameter mappings for converting OLMo (Open Language Model) weights from HuggingFace and AWQ formats into MLC LLM's internal representation. The structure closely mirrors the Llama loader, providing both huggingface and awq functions for standard and pre-quantized weight formats respectively.

Location

  • File: python/mlc_llm/model/olmo/olmo_loader.py
  • Lines: 172
  • Module: mlc_llm.model.olmo

Function: huggingface

def huggingface(model_config: OLMoConfig, quantization: Quantization) -> ExternMapping:

Returns a parameter mapping from MLC LLM parameter names to HuggingFace PyTorch parameter names for the OLMo architecture.

Parameters:

Parameter Type Description
model_config OLMoConfig The configuration of the OLMo model.
quantization Quantization The quantization configuration.

Initialization

model = OLMoForCausalLM(model_config)
if quantization is not None:
    model.to(quantization.model_dtype)
_, _named_params, _ = model.export_tvm(
    spec=model.get_default_spec(),
    allow_extern=True,
)
named_parameters = dict(_named_params)

Per-Layer Mappings

For each of the num_hidden_layers transformer layers:

QKV Projection Fusion

Q, K, V projection weights are concatenated along axis 0 into a single qkv_proj.weight:

attn = f"model.layers.{i}.self_attn"
mlc_name = f"{attn}.qkv_proj.weight"
mapping.add_mapping(
    mlc_name,
    [
        f"{attn}.q_proj.weight",
        f"{attn}.k_proj.weight",
        f"{attn}.v_proj.weight",
    ],
    functools.partial(
        lambda q, k, v, dtype: np.concatenate([q, k, v], axis=0).astype(dtype),
        dtype=mlc_param.dtype,
    ),
)

MLP Gate-Up Fusion

Gate and up projection weights are concatenated along axis 0:

mlp = f"model.layers.{i}.mlp"
mlc_name = f"{mlp}.gate_up_proj.weight"
mapping.add_mapping(
    mlc_name,
    [
        f"{mlp}.gate_proj.weight",
        f"{mlp}.up_proj.weight",
    ],
    functools.partial(
        lambda gate, up, dtype: np.concatenate([gate, up], axis=0).astype(dtype),
        dtype=mlc_param.dtype,
    ),
)

Unused Parameters

mapping.add_unused(f"{attn}.rotary_emb.inv_freq")

Identity Fallback

Remaining parameters are mapped with a dtype cast:

for mlc_name, mlc_param in named_parameters.items():
    if mlc_name not in mapping.param_map:
        mapping.add_mapping(
            mlc_name,
            [mlc_name],
            functools.partial(
                lambda x, dtype: x.astype(dtype),
                dtype=mlc_param.dtype,
            ),
        )

Function: awq

def awq(model_config: OLMoConfig, quantization: Quantization) -> ExternMapping:

Returns a parameter mapping from MLC LLM parameter names to AWQ pre-quantized parameter names.

Initialization

model, _ = awq_quant(model_config, quantization)
_, _named_params, _ = model.export_tvm(
    spec=model.get_default_spec(),
    allow_extern=True,
)
named_parameters = dict(_named_params)

Quantized Parameter Fusion

For each layer, three AWQ-specific suffixes (qweight, qzeros, scales) are mapped for both QKV attention and gate-up MLP projections. Concatenation uses axis=1 because AWQ GEMM transposes the weight:

for quantize_suffix in ["qweight", "qzeros", "scales"]:
    mlc_name = f"{attn}.qkv_proj.{quantize_suffix}"
    mapping.add_mapping(
        mlc_name,
        [
            f"{attn}.q_proj.{quantize_suffix}",
            f"{attn}.k_proj.{quantize_suffix}",
            f"{attn}.v_proj.{quantize_suffix}",
        ],
        functools.partial(
            lambda q, k, v, dtype: np.concatenate(
                [q, k, v],
                axis=1,  # AWQ GEMM would transpose the weight
            ).astype(dtype),
            dtype=mlc_param.dtype,
        ),
    )

The MLP gate-up AWQ fusion follows the same pattern with axis=1:

for quantize_suffix in ["qweight", "qzeros", "scales"]:
    mlc_name = f"{mlp}.gate_up_proj.{quantize_suffix}"
    mapping.add_mapping(
        mlc_name,
        [
            f"{mlp}.gate_proj.{quantize_suffix}",
            f"{mlp}.up_proj.{quantize_suffix}",
        ],
        functools.partial(
            lambda gate, up, dtype: np.concatenate(
                [gate, up],
                axis=1,  # AWQ GEMM would transpose the weight
            ).astype(dtype),
            dtype=mlc_param.dtype,
        ),
    )

Comparison with Llama Loader

The OLMo loader is structurally identical to the Llama loader. Both models share the same architecture pattern (Llama-style decoder) with the same parameter fusion strategy:

Aspect OLMo Llama
QKV fusion (HF) axis=0 axis=0
QKV fusion (AWQ) axis=1 axis=1
Gate-up fusion (HF) axis=0 axis=0
Gate-up fusion (AWQ) axis=1 axis=1
Rotary emb handling Marked unused Marked unused
export_tvm return 3 elements (both paths) 3 elements (both paths)

The only differences are the model class names (OLMoForCausalLM vs LlamaForCausalLM), configuration classes, and the quantization utility import.

Dependencies

  • functools -- for functools.partial
  • numpy -- for array concatenation and dtype casting
  • mlc_llm.loader.ExternMapping -- the core mapping data structure
  • mlc_llm.quantization.Quantization -- quantization configuration
  • .olmo_model.OLMoConfig, .olmo_model.OLMoForCausalLM -- OLMo model definitions
  • .olmo_quantization.awq_quant -- AWQ quantization utility

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment