Principle:Ollama Ollama GGUF Model Conversion Llama4
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, Llama |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Llama 4 conversion handles Meta's fourth-generation Llama architecture featuring chunked attention, interleaved Mixture-of-Experts layers with fused gate-up expert projections, iRoPE (interleaved RoPE), QK normalization, a vision encoder, and shared experts, transforming the complete multimodal model from HuggingFace SafeTensors to GGUF format.
Core Concepts
Tensor Name Mapping
The converter inherits the base Llama replacements and adds:
language_model.-> (stripped)vision_model->vmulti_modal_projector->mmfeed_forward.{down,up,gate}_proj->ffn_{down,up,gate}shared_expert.{down,gate,up}_proj->{down,gate,up}_shexpexperts.down_proj->down_exps.weightexperts.gate_up_proj->gate_up_exps.weightrouter->gate_inppatch_embedding.linear->patch_embedding
Architecture-Specific Hyperparameters
The GGUF metadata is written under the llama4.* namespace, inheriting Llama base parameters plus:
llama4.feed_forward_length-- dense MLP intermediate sizellama4.expert_feed_forward_length-- expert intermediate size (different from dense)llama4.expert_count-- number of local expertsllama4.expert_used_count-- experts per tokenllama4.interleave_moe_layer_step-- MoE layer interleaving frequencyllama4.use_qk_norm-- whether QK normalization is enabledllama4.attention.chunk_size-- chunked attention window size
Vision:
llama4.vision.block_count,embedding_length,feed_forward_lengthllama4.vision.attention.head_count,image_size,patch_sizellama4.vision.rope.freq_base,layer_norm_epsilonllama4.vision.pixel_shuffle_ratio-- downsampling ratio for vision features
Special Handling
Fused Gate-Up Expert Transposition
Expert gate_up_proj tensors arrive with shape [experts, hidden_size, intermediate_size * 2]. The converter splits along the last dimension into gate and up halves, then transposes dimensions 1 and 2 for each half to produce [experts, intermediate_size, hidden_size].
Down Expert Transposition
Expert down_proj tensors similarly require a dimension swap from [experts, intermediate_size, hidden_size] to [experts, hidden_size, intermediate_size].
Vision/MM Tensor Pass-Through
Vision (v.) and multimodal projector (mm.) tensors are passed through without repacking. Text tensors are processed through the inherited Llama tensor handler with repacking disabled (skipRepack = true) since Llama 4 uses iRoPE which does not require the standard Llama Q/K repacking.
Inherited Llama Text Processing
Text tensors (excluding vision and expert tensors) are processed through the base llamaModel.Tensors() with skipRepack set to true to avoid unnecessary Q/K weight permutation.
Implementation Notes
The conversion is implemented in convert/convert_llama4.go via the llama4Model struct, which embeds the base llamaModel in its TextModel field. The repack method creates closures that perform slice-then-transpose operations using the tensor library. KV metadata is generated by first calling the base Llama KV generator and then rewriting the namespace prefix from llama. to llama4..