Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion Olmo

From Leeroopedia
Knowledge Sources
Domains Model Conversion, OLMo
Last Updated 2025-02-15 00:00 GMT

Overview

OLMo (Open Language Model) conversion handles the Allen AI OLMo architecture, transforming the model from HuggingFace SafeTensors to GGUF format with support for sliding window attention patterns, QK normalization with separate pre/post norms, and various RoPE scaling configurations including YaRN.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

  • lm_head -> output
  • model.embed_tokens -> token_embd
  • model.layers -> blk
  • model.norm -> output_norm
  • self_attn.q_proj -> attn_q
  • self_attn.k_proj -> attn_k
  • self_attn.v_proj -> attn_v
  • self_attn.o_proj -> attn_output
  • self_attn.q_norm -> attn_q_norm
  • self_attn.k_norm -> attn_k_norm
  • post_attention_layernorm -> post_attention_norm
  • post_feedforward_layernorm -> post_ffw_norm
  • mlp.gate_proj -> ffn_gate
  • mlp.down_proj -> ffn_down
  • mlp.up_proj -> ffn_up

Architecture-Specific Hyperparameters

The GGUF metadata is written under the olmo3.* namespace:

  • olmo3.block_count -- number of hidden layers
  • olmo3.context_length -- max position embeddings
  • olmo3.embedding_length -- hidden size
  • olmo3.feed_forward_length -- intermediate size
  • olmo3.attention.head_count -- attention heads
  • olmo3.attention.head_count_kv -- KV heads (defaults to num_attention_heads if not specified)
  • olmo3.rope.freq_base -- RoPE theta
  • olmo3.rope.scaling.* -- factor, original_context_length, attn_factor, type
  • olmo3.attention.layer_norm_rms_epsilon -- RMSNorm epsilon
  • olmo3.attention.sliding_window -- sliding window size
  • olmo3.attention.sliding_window_pattern -- per-layer boolean array for hybrid attention

Special Handling

Sliding Window Pattern

When layer_types is present in the config, the converter builds a per-layer boolean array where true indicates sliding (local) attention and false indicates full (global) attention. This is determined by checking if each layer type string equals "sliding_attention".

QK Normalization

OLMo uses separate Q and K normalization (attn_q_norm and attn_k_norm) applied to the query and key projections before attention computation.

Post-Layer Normalization

OLMo uses post-attention and post-feedforward normalization layers (post_attention_norm and post_ffw_norm) in addition to the standard pre-attention norm, following a "sandwich" normalization pattern.

No Q/K Repacking

Unlike Llama-family models, OLMo tensors pass through without Q/K weight permutation. The tensors are written directly to GGUF with their original weight layout.

RoPE Scaling

The converter supports optional RoPE scaling with YaRN-style parameters. The rope_scaling field is a pointer type, so it may be absent entirely.

Implementation Notes

The conversion is implemented in convert/convert_olmo.go via the olmoModel struct. Despite the struct being named olmoModel, the GGUF architecture identifier is olmo3, reflecting the OLMo 3 generation. The converter is straightforward with no tensor repacking, making it one of the simpler converters in the codebase.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment