Principle:Ollama Ollama GGUF Model Conversion Olmo

Knowledge Sources	Ollama
Domains	Model Conversion, OLMo
Last Updated	2025-02-15 00:00 GMT

Overview

OLMo (Open Language Model) conversion handles the Allen AI OLMo architecture, transforming the model from HuggingFace SafeTensors to GGUF format with support for sliding window attention patterns, QK normalization with separate pre/post norms, and various RoPE scaling configurations including YaRN.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

lm_head -> output
model.embed_tokens -> token_embd
model.layers -> blk
model.norm -> output_norm
self_attn.q_proj -> attn_q
self_attn.k_proj -> attn_k
self_attn.v_proj -> attn_v
self_attn.o_proj -> attn_output
self_attn.q_norm -> attn_q_norm
self_attn.k_norm -> attn_k_norm
post_attention_layernorm -> post_attention_norm
post_feedforward_layernorm -> post_ffw_norm
mlp.gate_proj -> ffn_gate
mlp.down_proj -> ffn_down
mlp.up_proj -> ffn_up

Architecture-Specific Hyperparameters

The GGUF metadata is written under the olmo3.* namespace:

olmo3.block_count -- number of hidden layers
olmo3.context_length -- max position embeddings
olmo3.embedding_length -- hidden size
olmo3.feed_forward_length -- intermediate size
olmo3.attention.head_count -- attention heads
olmo3.attention.head_count_kv -- KV heads (defaults to num_attention_heads if not specified)
olmo3.rope.freq_base -- RoPE theta
olmo3.rope.scaling.* -- factor, original_context_length, attn_factor, type
olmo3.attention.layer_norm_rms_epsilon -- RMSNorm epsilon
olmo3.attention.sliding_window -- sliding window size
olmo3.attention.sliding_window_pattern -- per-layer boolean array for hybrid attention

Special Handling

Sliding Window Pattern

When layer_types is present in the config, the converter builds a per-layer boolean array where true indicates sliding (local) attention and false indicates full (global) attention. This is determined by checking if each layer type string equals "sliding_attention".

QK Normalization

OLMo uses separate Q and K normalization (attn_q_norm and attn_k_norm) applied to the query and key projections before attention computation.

Post-Layer Normalization

OLMo uses post-attention and post-feedforward normalization layers (post_attention_norm and post_ffw_norm) in addition to the standard pre-attention norm, following a "sandwich" normalization pattern.

No Q/K Repacking

Unlike Llama-family models, OLMo tensors pass through without Q/K weight permutation. The tensors are written directly to GGUF with their original weight layout.

RoPE Scaling

The converter supports optional RoPE scaling with YaRN-style parameters. The rope_scaling field is a pointer type, so it may be absent entirely.

Implementation Notes

The conversion is implemented in convert/convert_olmo.go via the olmoModel struct. Despite the struct being named olmoModel, the GGUF architecture identifier is olmo3, reflecting the OLMo 3 generation. The converter is straightforward with no tensor repacking, making it one of the simpler converters in the codebase.

Related Pages

Implementation:Ollama_Ollama_Convert_Olmo

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment