Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion Llama4

From Leeroopedia
Knowledge Sources
Domains Model Conversion, Llama
Last Updated 2025-02-15 00:00 GMT

Overview

Llama 4 conversion handles Meta's fourth-generation Llama architecture featuring chunked attention, interleaved Mixture-of-Experts layers with fused gate-up expert projections, iRoPE (interleaved RoPE), QK normalization, a vision encoder, and shared experts, transforming the complete multimodal model from HuggingFace SafeTensors to GGUF format.

Core Concepts

Tensor Name Mapping

The converter inherits the base Llama replacements and adds:

  • language_model. -> (stripped)
  • vision_model -> v
  • multi_modal_projector -> mm
  • feed_forward.{down,up,gate}_proj -> ffn_{down,up,gate}
  • shared_expert.{down,gate,up}_proj -> {down,gate,up}_shexp
  • experts.down_proj -> down_exps.weight
  • experts.gate_up_proj -> gate_up_exps.weight
  • router -> gate_inp
  • patch_embedding.linear -> patch_embedding

Architecture-Specific Hyperparameters

The GGUF metadata is written under the llama4.* namespace, inheriting Llama base parameters plus:

  • llama4.feed_forward_length -- dense MLP intermediate size
  • llama4.expert_feed_forward_length -- expert intermediate size (different from dense)
  • llama4.expert_count -- number of local experts
  • llama4.expert_used_count -- experts per token
  • llama4.interleave_moe_layer_step -- MoE layer interleaving frequency
  • llama4.use_qk_norm -- whether QK normalization is enabled
  • llama4.attention.chunk_size -- chunked attention window size

Vision:

  • llama4.vision.block_count, embedding_length, feed_forward_length
  • llama4.vision.attention.head_count, image_size, patch_size
  • llama4.vision.rope.freq_base, layer_norm_epsilon
  • llama4.vision.pixel_shuffle_ratio -- downsampling ratio for vision features

Special Handling

Fused Gate-Up Expert Transposition

Expert gate_up_proj tensors arrive with shape [experts, hidden_size, intermediate_size * 2]. The converter splits along the last dimension into gate and up halves, then transposes dimensions 1 and 2 for each half to produce [experts, intermediate_size, hidden_size].

Down Expert Transposition

Expert down_proj tensors similarly require a dimension swap from [experts, intermediate_size, hidden_size] to [experts, hidden_size, intermediate_size].

Vision/MM Tensor Pass-Through

Vision (v.) and multimodal projector (mm.) tensors are passed through without repacking. Text tensors are processed through the inherited Llama tensor handler with repacking disabled (skipRepack = true) since Llama 4 uses iRoPE which does not require the standard Llama Q/K repacking.

Inherited Llama Text Processing

Text tensors (excluding vision and expert tensors) are processed through the base llamaModel.Tensors() with skipRepack set to true to avoid unnecessary Q/K weight permutation.

Implementation Notes

The conversion is implemented in convert/convert_llama4.go via the llama4Model struct, which embeds the base llamaModel in its TextModel field. The repack method creates closures that perform slice-then-transpose operations using the tensor library. KV metadata is generated by first calling the base Llama KV generator and then rewriting the namespace prefix from llama. to llama4..

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment