Implementation:Mlc ai Mlc llm OLMo Model
| Knowledge Sources | |
|---|---|
| Domains | Model_Architecture, LLM |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Implements the OLMo (Open Language Model) architecture for causal language modeling within the MLC LLM framework, featuring LayerNorm without affine parameters, optional QKV clamping, pipeline parallelism, and disaggregated inference support.
Description
This module provides the TVM Relax-based implementation of AI2's OLMo model architecture. While structurally similar to Llama, OLMo has several distinctive characteristics that differentiate it:
- LayerNorm without affine parameters: Unlike Llama's RMSNorm with learnable weights, OLMo uses standard LayerNorm with
elementwise_affine=False, meaning no learnable scale or bias in the normalization layers. - QKV clamping (
clip_qkv): An optional feature that clamps QKV projection outputs to[-clip_qkv, clip_qkv]range after the linear projection, providing training stability for certain model variants. - Configurable activation functions: Supports multiple activation functions (silu, gelu, relu, swish, gelu_new) via the
ACT2FNmapping dictionary, configured throughhidden_act. - Tied word embeddings: Uses a custom
OLMoEmbeddingclass that supports weight transposition for shared embedding/lm_head vialm_head_forward. - Pipeline parallelism: Supports partitioning layers across pipeline stages with
pipeline_parallel_stagesand automatic boundary insertion viaop_ext.pipeline_stage_boundary. - Disaggregated inference: Provides dedicated methods for extracting last hidden states (
prefill_to_last_hidden_states,batch_forward_to_last_hidden_states,batch_select_last_hidden_states, etc.) enabling disaggregated prefill/decode workflows. - RoPE with optional scaling: Supports standard RoPE and configurable rope_scaling via
rope_scalingparameter.
The model consists of OLMoModel (embedding + decoder layers with pipeline partitioning + final LayerNorm), wrapped by OLMoForCausalLM which adds the LM head and the full suite of inference methods.
Usage
Use this module when compiling OLMo family models (OLMo 1B, 7B, 65B, etc.) for deployment with MLC LLM. The model is identified by the olmo model type in configuration files.
Code Reference
Source Location
- Repository: Mlc_ai_Mlc_llm
- File: python/mlc_llm/model/olmo/olmo_model.py
Signature
@dataclasses.dataclass
class OLMoConfig(ConfigBase):
vocab_size: int = None
hidden_size: int = None
num_attention_heads: int = None
num_key_value_heads: int = 0
head_dim: int = 0
position_embedding_base: int = 0
rope_scaling: Optional[Dict[str, Any]] = None
intermediate_size: int = None
hidden_act: str = None
num_hidden_layers: int = None
tie_word_embeddings: bool = False
context_window_size: int = 0
prefill_chunk_size: int = 0
tensor_parallel_shards: int = 1
pipeline_parallel_stages: int = 1
clip_qkv: float = None
...
class OLMoForCausalLM(nn.Module):
def __init__(self, config: OLMoConfig): ...
def embed(self, input_ids: Tensor): ...
def get_logits(self, hidden_states: Tensor): ...
def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def batch_prefill(self, input_embeds, logit_positions, paged_kv_cache): ...
def batch_decode(self, input_embeds, paged_kv_cache): ...
def batch_verify(self, input_embeds, paged_kv_cache): ...
def prefill_to_last_hidden_states(self, input_embed, paged_kv_cache): ...
def batch_forward_to_last_hidden_states(self, input_embeds, paged_kv_cache): ...
def batch_select_last_hidden_states(self, hidden_states, logit_positions): ...
def create_paged_kv_cache(self, ...): ...
def get_default_spec(self): ...
Import
from mlc_llm.model.olmo.olmo_model import OLMoConfig, OLMoForCausalLM
I/O Contract
Primary Classes
| Class | Role | Key Characteristics |
|---|---|---|
| OLMoConfig | Model configuration | Optional clip_qkv, configurable hidden_act, pipeline parallelism |
| OLMoEmbedding | Shared embedding | lm_head_forward via weight transposition (from Qwen2Embedding) |
| OLMoAttention | GQA attention | Fused qkv_proj with optional clip_qkv clamping |
| OLMoFFN | Gated feed-forward | Configurable activation via ACT2FN, gate_up_proj + down_proj |
| OLMoDecoderLayer | Transformer block | LayerNorm without affine parameters (elementwise_affine=False) |
| OLMoModel | Core model | embed_tokens + layers with pipeline partition + LayerNorm |
| OLMoForCausalLM | Top-level model | Full inference suite with disaggregated methods, tied embedding support |
Forward Methods
| Method | Input | Output |
|---|---|---|
embed |
Tensor[seq_len] (int32) | Tensor[seq_len, hidden_size] |
get_logits |
Tensor[seq_len, hidden_size] | Tensor[seq_len, vocab_size] (float32) |
prefill |
Tensor[1, seq_len, hidden_size], PagedKVCache | (Tensor[1, 1, vocab_size], PagedKVCache) |
decode |
Tensor[1, 1, hidden_size], PagedKVCache | (Tensor[1, 1, vocab_size], PagedKVCache) |
batch_prefill |
Tensor[1, seq_len, hidden_size], Tensor[batch_size], PagedKVCache | (Tensor, PagedKVCache) |
batch_decode |
Tensor[batch_size, 1, hidden_size], PagedKVCache | (Tensor, PagedKVCache) |
prefill_to_last_hidden_states |
Tensor[1, seq_len, hidden_size], PagedKVCache | (Tensor[1, seq_len, hidden_size], PagedKVCache) |
batch_select_last_hidden_states |
Tensor[seq_len, hidden_size], Tensor[batch_size] | Tensor[batch_size, hidden_size] |
QKV Clamping
When clip_qkv is set in the configuration, the attention module applies element-wise clamping to the QKV projection output:
# In OLMoAttention.forward():
qkv = self.qkv_proj(hidden_states)
if self.clip_qkv is not None:
qkv = qkv.maximum(-self.clip_qkv).minimum(self.clip_qkv)
Key Differences from Llama
| Feature | Llama | OLMo |
|---|---|---|
| Normalization | RMSNorm with learnable weight | LayerNorm without affine parameters |
| QKV clamping | Not supported | Optional via clip_qkv |
| Activation | SiLU only | Configurable (silu, gelu, relu, etc.) |
| LM head | Separate or tied (LlamaEmbedding) | Separate or tied (OLMoEmbedding) |
| Pipeline parallel | Supported | Supported (identical logic) |
| Disaggregation | Supported | Supported (identical logic) |
Usage Examples
# Creating an OLMo config
config = OLMoConfig(
vocab_size=50304,
hidden_size=4096,
num_attention_heads=32,
num_key_value_heads=32,
intermediate_size=11008,
hidden_act="silu",
num_hidden_layers=32,
tie_word_embeddings=False,
position_embedding_base=10000,
context_window_size=2048,
clip_qkv=None, # Set to a float value to enable QKV clamping
)
model = OLMoForCausalLM(config)
# With QKV clamping enabled
config_with_clip = OLMoConfig(
vocab_size=50304,
hidden_size=4096,
num_attention_heads=32,
intermediate_size=11008,
hidden_act="silu",
num_hidden_layers=32,
clip_qkv=8.0,
context_window_size=2048,
)