Principle:Ollama Ollama GGUF Model Conversion Llama4

Knowledge Sources	Ollama
Domains	Model Conversion, Llama
Last Updated	2025-02-15 00:00 GMT

Overview

Llama 4 conversion handles Meta's fourth-generation Llama architecture featuring chunked attention, interleaved Mixture-of-Experts layers with fused gate-up expert projections, iRoPE (interleaved RoPE), QK normalization, a vision encoder, and shared experts, transforming the complete multimodal model from HuggingFace SafeTensors to GGUF format.

Core Concepts

Tensor Name Mapping

The converter inherits the base Llama replacements and adds:

language_model. -> (stripped)
vision_model -> v
multi_modal_projector -> mm
feed_forward.{down,up,gate}_proj -> ffn_{down,up,gate}
shared_expert.{down,gate,up}_proj -> {down,gate,up}_shexp
experts.down_proj -> down_exps.weight
experts.gate_up_proj -> gate_up_exps.weight
router -> gate_inp
patch_embedding.linear -> patch_embedding

Architecture-Specific Hyperparameters

The GGUF metadata is written under the llama4.* namespace, inheriting Llama base parameters plus:

llama4.feed_forward_length -- dense MLP intermediate size
llama4.expert_feed_forward_length -- expert intermediate size (different from dense)
llama4.expert_count -- number of local experts
llama4.expert_used_count -- experts per token
llama4.interleave_moe_layer_step -- MoE layer interleaving frequency
llama4.use_qk_norm -- whether QK normalization is enabled
llama4.attention.chunk_size -- chunked attention window size

Vision:

llama4.vision.block_count, embedding_length, feed_forward_length
llama4.vision.attention.head_count, image_size, patch_size
llama4.vision.rope.freq_base, layer_norm_epsilon
llama4.vision.pixel_shuffle_ratio -- downsampling ratio for vision features

Special Handling

Fused Gate-Up Expert Transposition

Expert gate_up_proj tensors arrive with shape [experts, hidden_size, intermediate_size * 2]. The converter splits along the last dimension into gate and up halves, then transposes dimensions 1 and 2 for each half to produce [experts, intermediate_size, hidden_size].

Down Expert Transposition

Expert down_proj tensors similarly require a dimension swap from [experts, intermediate_size, hidden_size] to [experts, hidden_size, intermediate_size].

Vision/MM Tensor Pass-Through

Vision (v.) and multimodal projector (mm.) tensors are passed through without repacking. Text tensors are processed through the inherited Llama tensor handler with repacking disabled (skipRepack = true) since Llama 4 uses iRoPE which does not require the standard Llama Q/K repacking.

Inherited Llama Text Processing

Text tensors (excluding vision and expert tensors) are processed through the base llamaModel.Tensors() with skipRepack set to true to avoid unnecessary Q/K weight permutation.

Implementation Notes

The conversion is implemented in convert/convert_llama4.go via the llama4Model struct, which embeds the base llamaModel in its TextModel field. The repack method creates closures that perform slice-then-transpose operations using the tensor library. KV metadata is generated by first calling the base Llama KV generator and then rewriting the namespace prefix from llama. to llama4..

Related Pages

Implementation:Ollama_Ollama_Convert_Llama4

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment