Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm OLMo Model

From Leeroopedia


Knowledge Sources
Domains Model_Architecture, LLM
Last Updated 2026-02-09 19:00 GMT

Overview

Implements the OLMo (Open Language Model) architecture for causal language modeling within the MLC LLM framework, featuring LayerNorm without affine parameters, optional QKV clamping, pipeline parallelism, and disaggregated inference support.

Description

This module provides the TVM Relax-based implementation of AI2's OLMo model architecture. While structurally similar to Llama, OLMo has several distinctive characteristics that differentiate it:

  • LayerNorm without affine parameters: Unlike Llama's RMSNorm with learnable weights, OLMo uses standard LayerNorm with elementwise_affine=False, meaning no learnable scale or bias in the normalization layers.
  • QKV clamping (clip_qkv): An optional feature that clamps QKV projection outputs to [-clip_qkv, clip_qkv] range after the linear projection, providing training stability for certain model variants.
  • Configurable activation functions: Supports multiple activation functions (silu, gelu, relu, swish, gelu_new) via the ACT2FN mapping dictionary, configured through hidden_act.
  • Tied word embeddings: Uses a custom OLMoEmbedding class that supports weight transposition for shared embedding/lm_head via lm_head_forward.
  • Pipeline parallelism: Supports partitioning layers across pipeline stages with pipeline_parallel_stages and automatic boundary insertion via op_ext.pipeline_stage_boundary.
  • Disaggregated inference: Provides dedicated methods for extracting last hidden states (prefill_to_last_hidden_states, batch_forward_to_last_hidden_states, batch_select_last_hidden_states, etc.) enabling disaggregated prefill/decode workflows.
  • RoPE with optional scaling: Supports standard RoPE and configurable rope_scaling via rope_scaling parameter.

The model consists of OLMoModel (embedding + decoder layers with pipeline partitioning + final LayerNorm), wrapped by OLMoForCausalLM which adds the LM head and the full suite of inference methods.

Usage

Use this module when compiling OLMo family models (OLMo 1B, 7B, 65B, etc.) for deployment with MLC LLM. The model is identified by the olmo model type in configuration files.

Code Reference

Source Location

Signature

@dataclasses.dataclass
class OLMoConfig(ConfigBase):
    vocab_size: int = None
    hidden_size: int = None
    num_attention_heads: int = None
    num_key_value_heads: int = 0
    head_dim: int = 0
    position_embedding_base: int = 0
    rope_scaling: Optional[Dict[str, Any]] = None
    intermediate_size: int = None
    hidden_act: str = None
    num_hidden_layers: int = None
    tie_word_embeddings: bool = False
    context_window_size: int = 0
    prefill_chunk_size: int = 0
    tensor_parallel_shards: int = 1
    pipeline_parallel_stages: int = 1
    clip_qkv: float = None
    ...

class OLMoForCausalLM(nn.Module):
    def __init__(self, config: OLMoConfig): ...
    def embed(self, input_ids: Tensor): ...
    def get_logits(self, hidden_states: Tensor): ...
    def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_prefill(self, input_embeds, logit_positions, paged_kv_cache): ...
    def batch_decode(self, input_embeds, paged_kv_cache): ...
    def batch_verify(self, input_embeds, paged_kv_cache): ...
    def prefill_to_last_hidden_states(self, input_embed, paged_kv_cache): ...
    def batch_forward_to_last_hidden_states(self, input_embeds, paged_kv_cache): ...
    def batch_select_last_hidden_states(self, hidden_states, logit_positions): ...
    def create_paged_kv_cache(self, ...): ...
    def get_default_spec(self): ...

Import

from mlc_llm.model.olmo.olmo_model import OLMoConfig, OLMoForCausalLM

I/O Contract

Primary Classes

Class Role Key Characteristics
OLMoConfig Model configuration Optional clip_qkv, configurable hidden_act, pipeline parallelism
OLMoEmbedding Shared embedding lm_head_forward via weight transposition (from Qwen2Embedding)
OLMoAttention GQA attention Fused qkv_proj with optional clip_qkv clamping
OLMoFFN Gated feed-forward Configurable activation via ACT2FN, gate_up_proj + down_proj
OLMoDecoderLayer Transformer block LayerNorm without affine parameters (elementwise_affine=False)
OLMoModel Core model embed_tokens + layers with pipeline partition + LayerNorm
OLMoForCausalLM Top-level model Full inference suite with disaggregated methods, tied embedding support

Forward Methods

Method Input Output
embed Tensor[seq_len] (int32) Tensor[seq_len, hidden_size]
get_logits Tensor[seq_len, hidden_size] Tensor[seq_len, vocab_size] (float32)
prefill Tensor[1, seq_len, hidden_size], PagedKVCache (Tensor[1, 1, vocab_size], PagedKVCache)
decode Tensor[1, 1, hidden_size], PagedKVCache (Tensor[1, 1, vocab_size], PagedKVCache)
batch_prefill Tensor[1, seq_len, hidden_size], Tensor[batch_size], PagedKVCache (Tensor, PagedKVCache)
batch_decode Tensor[batch_size, 1, hidden_size], PagedKVCache (Tensor, PagedKVCache)
prefill_to_last_hidden_states Tensor[1, seq_len, hidden_size], PagedKVCache (Tensor[1, seq_len, hidden_size], PagedKVCache)
batch_select_last_hidden_states Tensor[seq_len, hidden_size], Tensor[batch_size] Tensor[batch_size, hidden_size]

QKV Clamping

When clip_qkv is set in the configuration, the attention module applies element-wise clamping to the QKV projection output:

# In OLMoAttention.forward():
qkv = self.qkv_proj(hidden_states)
if self.clip_qkv is not None:
    qkv = qkv.maximum(-self.clip_qkv).minimum(self.clip_qkv)

Key Differences from Llama

Feature Llama OLMo
Normalization RMSNorm with learnable weight LayerNorm without affine parameters
QKV clamping Not supported Optional via clip_qkv
Activation SiLU only Configurable (silu, gelu, relu, etc.)
LM head Separate or tied (LlamaEmbedding) Separate or tied (OLMoEmbedding)
Pipeline parallel Supported Supported (identical logic)
Disaggregation Supported Supported (identical logic)

Usage Examples

# Creating an OLMo config
config = OLMoConfig(
    vocab_size=50304,
    hidden_size=4096,
    num_attention_heads=32,
    num_key_value_heads=32,
    intermediate_size=11008,
    hidden_act="silu",
    num_hidden_layers=32,
    tie_word_embeddings=False,
    position_embedding_base=10000,
    context_window_size=2048,
    clip_qkv=None,  # Set to a float value to enable QKV clamping
)
model = OLMoForCausalLM(config)

# With QKV clamping enabled
config_with_clip = OLMoConfig(
    vocab_size=50304,
    hidden_size=4096,
    num_attention_heads=32,
    intermediate_size=11008,
    hidden_act="silu",
    num_hidden_layers=32,
    clip_qkv=8.0,
    context_window_size=2048,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment