Implementation:Mlc ai Mlc llm OLMo Model

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	Model_Architecture, LLM
Last Updated	2026-02-09 19:00 GMT

Overview

Implements the OLMo (Open Language Model) architecture for causal language modeling within the MLC LLM framework, featuring LayerNorm without affine parameters, optional QKV clamping, pipeline parallelism, and disaggregated inference support.

Description

This module provides the TVM Relax-based implementation of AI2's OLMo model architecture. While structurally similar to Llama, OLMo has several distinctive characteristics that differentiate it:

LayerNorm without affine parameters: Unlike Llama's RMSNorm with learnable weights, OLMo uses standard LayerNorm with elementwise_affine=False, meaning no learnable scale or bias in the normalization layers.
QKV clamping (clip_qkv): An optional feature that clamps QKV projection outputs to [-clip_qkv, clip_qkv] range after the linear projection, providing training stability for certain model variants.
Configurable activation functions: Supports multiple activation functions (silu, gelu, relu, swish, gelu_new) via the ACT2FN mapping dictionary, configured through hidden_act.
Tied word embeddings: Uses a custom OLMoEmbedding class that supports weight transposition for shared embedding/lm_head via lm_head_forward.
Pipeline parallelism: Supports partitioning layers across pipeline stages with pipeline_parallel_stages and automatic boundary insertion via op_ext.pipeline_stage_boundary.
Disaggregated inference: Provides dedicated methods for extracting last hidden states (prefill_to_last_hidden_states, batch_forward_to_last_hidden_states, batch_select_last_hidden_states, etc.) enabling disaggregated prefill/decode workflows.
RoPE with optional scaling: Supports standard RoPE and configurable rope_scaling via rope_scaling parameter.

The model consists of OLMoModel (embedding + decoder layers with pipeline partitioning + final LayerNorm), wrapped by OLMoForCausalLM which adds the LM head and the full suite of inference methods.

Usage

Use this module when compiling OLMo family models (OLMo 1B, 7B, 65B, etc.) for deployment with MLC LLM. The model is identified by the olmo model type in configuration files.

Code Reference

Source Location

Repository: Mlc_ai_Mlc_llm
File: python/mlc_llm/model/olmo/olmo_model.py

Signature

@dataclasses.dataclass
class OLMoConfig(ConfigBase):
    vocab_size: int = None
    hidden_size: int = None
    num_attention_heads: int = None
    num_key_value_heads: int = 0
    head_dim: int = 0
    position_embedding_base: int = 0
    rope_scaling: Optional[Dict[str, Any]] = None
    intermediate_size: int = None
    hidden_act: str = None
    num_hidden_layers: int = None
    tie_word_embeddings: bool = False
    context_window_size: int = 0
    prefill_chunk_size: int = 0
    tensor_parallel_shards: int = 1
    pipeline_parallel_stages: int = 1
    clip_qkv: float = None
    ...

class OLMoForCausalLM(nn.Module):
    def __init__(self, config: OLMoConfig): ...
    def embed(self, input_ids: Tensor): ...
    def get_logits(self, hidden_states: Tensor): ...
    def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_prefill(self, input_embeds, logit_positions, paged_kv_cache): ...
    def batch_decode(self, input_embeds, paged_kv_cache): ...
    def batch_verify(self, input_embeds, paged_kv_cache): ...
    def prefill_to_last_hidden_states(self, input_embed, paged_kv_cache): ...
    def batch_forward_to_last_hidden_states(self, input_embeds, paged_kv_cache): ...
    def batch_select_last_hidden_states(self, hidden_states, logit_positions): ...
    def create_paged_kv_cache(self, ...): ...
    def get_default_spec(self): ...

Import

from mlc_llm.model.olmo.olmo_model import OLMoConfig, OLMoForCausalLM

I/O Contract

Primary Classes

Class	Role	Key Characteristics
OLMoConfig	Model configuration	Optional clip_qkv, configurable hidden_act, pipeline parallelism
OLMoEmbedding	Shared embedding	lm_head_forward via weight transposition (from Qwen2Embedding)
OLMoAttention	GQA attention	Fused qkv_proj with optional clip_qkv clamping
OLMoFFN	Gated feed-forward	Configurable activation via ACT2FN, gate_up_proj + down_proj
OLMoDecoderLayer	Transformer block	LayerNorm without affine parameters (elementwise_affine=False)
OLMoModel	Core model	embed_tokens + layers with pipeline partition + LayerNorm
OLMoForCausalLM	Top-level model	Full inference suite with disaggregated methods, tied embedding support

Forward Methods

Method	Input	Output
`embed`	Tensor[seq_len] (int32)	Tensor[seq_len, hidden_size]
`get_logits`	Tensor[seq_len, hidden_size]	Tensor[seq_len, vocab_size] (float32)
`prefill`	Tensor[1, seq_len, hidden_size], PagedKVCache	(Tensor[1, 1, vocab_size], PagedKVCache)
`decode`	Tensor[1, 1, hidden_size], PagedKVCache	(Tensor[1, 1, vocab_size], PagedKVCache)
`batch_prefill`	Tensor[1, seq_len, hidden_size], Tensor[batch_size], PagedKVCache	(Tensor, PagedKVCache)
`batch_decode`	Tensor[batch_size, 1, hidden_size], PagedKVCache	(Tensor, PagedKVCache)
`prefill_to_last_hidden_states`	Tensor[1, seq_len, hidden_size], PagedKVCache	(Tensor[1, seq_len, hidden_size], PagedKVCache)
`batch_select_last_hidden_states`	Tensor[seq_len, hidden_size], Tensor[batch_size]	Tensor[batch_size, hidden_size]

QKV Clamping

When clip_qkv is set in the configuration, the attention module applies element-wise clamping to the QKV projection output:

# In OLMoAttention.forward():
qkv = self.qkv_proj(hidden_states)
if self.clip_qkv is not None:
    qkv = qkv.maximum(-self.clip_qkv).minimum(self.clip_qkv)

Key Differences from Llama

Feature	Llama	OLMo
Normalization	RMSNorm with learnable weight	LayerNorm without affine parameters
QKV clamping	Not supported	Optional via clip_qkv
Activation	SiLU only	Configurable (silu, gelu, relu, etc.)
LM head	Separate or tied (LlamaEmbedding)	Separate or tied (OLMoEmbedding)
Pipeline parallel	Supported	Supported (identical logic)
Disaggregation	Supported	Supported (identical logic)

Usage Examples

# Creating an OLMo config
config = OLMoConfig(
    vocab_size=50304,
    hidden_size=4096,
    num_attention_heads=32,
    num_key_value_heads=32,
    intermediate_size=11008,
    hidden_act="silu",
    num_hidden_layers=32,
    tie_word_embeddings=False,
    position_embedding_base=10000,
    context_window_size=2048,
    clip_qkv=None,  # Set to a float value to enable QKV clamping
)
model = OLMoForCausalLM(config)

# With QKV clamping enabled
config_with_clip = OLMoConfig(
    vocab_size=50304,
    hidden_size=4096,
    num_attention_heads=32,
    intermediate_size=11008,
    hidden_act="silu",
    num_hidden_layers=32,
    clip_qkv=8.0,
    context_window_size=2048,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment