Principle:Ollama Ollama GGUF Model Conversion Olmo
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, OLMo |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
OLMo (Open Language Model) conversion handles the Allen AI OLMo architecture, transforming the model from HuggingFace SafeTensors to GGUF format with support for sliding window attention patterns, QK normalization with separate pre/post norms, and various RoPE scaling configurations including YaRN.
Core Concepts
Tensor Name Mapping
The converter applies the following HuggingFace-to-GGUF tensor name replacements:
lm_head->outputmodel.embed_tokens->token_embdmodel.layers->blkmodel.norm->output_normself_attn.q_proj->attn_qself_attn.k_proj->attn_kself_attn.v_proj->attn_vself_attn.o_proj->attn_outputself_attn.q_norm->attn_q_normself_attn.k_norm->attn_k_normpost_attention_layernorm->post_attention_normpost_feedforward_layernorm->post_ffw_normmlp.gate_proj->ffn_gatemlp.down_proj->ffn_downmlp.up_proj->ffn_up
Architecture-Specific Hyperparameters
The GGUF metadata is written under the olmo3.* namespace:
olmo3.block_count-- number of hidden layersolmo3.context_length-- max position embeddingsolmo3.embedding_length-- hidden sizeolmo3.feed_forward_length-- intermediate sizeolmo3.attention.head_count-- attention headsolmo3.attention.head_count_kv-- KV heads (defaults to num_attention_heads if not specified)olmo3.rope.freq_base-- RoPE thetaolmo3.rope.scaling.*-- factor, original_context_length, attn_factor, typeolmo3.attention.layer_norm_rms_epsilon-- RMSNorm epsilonolmo3.attention.sliding_window-- sliding window sizeolmo3.attention.sliding_window_pattern-- per-layer boolean array for hybrid attention
Special Handling
Sliding Window Pattern
When layer_types is present in the config, the converter builds a per-layer boolean array where true indicates sliding (local) attention and false indicates full (global) attention. This is determined by checking if each layer type string equals "sliding_attention".
QK Normalization
OLMo uses separate Q and K normalization (attn_q_norm and attn_k_norm) applied to the query and key projections before attention computation.
Post-Layer Normalization
OLMo uses post-attention and post-feedforward normalization layers (post_attention_norm and post_ffw_norm) in addition to the standard pre-attention norm, following a "sandwich" normalization pattern.
No Q/K Repacking
Unlike Llama-family models, OLMo tensors pass through without Q/K weight permutation. The tensors are written directly to GGUF with their original weight layout.
RoPE Scaling
The converter supports optional RoPE scaling with YaRN-style parameters. The rope_scaling field is a pointer type, so it may be absent entirely.
Implementation Notes
The conversion is implemented in convert/convert_olmo.go via the olmoModel struct. Despite the struct being named olmoModel, the GGUF architecture identifier is olmo3, reflecting the OLMo 3 generation. The converter is straightforward with no tensor repacking, making it one of the simpler converters in the codebase.