Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion MLLama

From Leeroopedia
Knowledge Sources
Domains Model Conversion, Multimodal
Last Updated 2025-02-15 00:00 GMT

Overview

Meta Llama multimodal (MLLama) conversion handles the vision-language architecture combining a Llama text backbone with cross-attention layers and a multi-scale vision encoder (local + global transformer layers), transforming the complete pipeline from HuggingFace SafeTensors to GGUF format with gate tensor tanh transformations and Q/K repacking for the vision encoder.

Core Concepts

Tensor Name Mapping

The converter inherits the base Llama replacements and adds:

  • language_model. -> (stripped)
  • gate_attn -> attn_gate
  • gate_ffn -> ffn_gate
  • cross_attn. -> cross_attn_
  • vision_model -> v
  • class_embedding -> class_embd
  • patch_embedding -> patch_embd
  • gated_positional_embedding.tile_embedding -> tile_position_embd
  • gated_positional_embedding.embedding -> position_embd.weight
  • pre_tile_positional_embedding -> pre_tile_position_embd
  • post_tile_positional_embedding -> post_tile_position_embd
  • global_transformer.layers -> global.blk
  • transformer.layers -> blk
  • mlp.fc1 -> ffn_up
  • mlp.fc2 -> ffn_down
  • multi_modal_projector -> mm.0

Architecture-Specific Hyperparameters

The GGUF metadata is written under the mllama.* namespace, inheriting Llama text parameters (rewritten from llama. to mllama.) plus:

  • mllama.attention.cross_attention_layers -- indices of layers with cross-attention
  • mllama.vision.block_count -- local vision transformer layers
  • mllama.vision.global.block_count -- global vision transformer layers
  • mllama.vision.intermediate_layers_indices -- indices of intermediate feature layers
  • mllama.vision.embedding_length, feed_forward_length
  • mllama.vision.attention.head_count, layer_norm_epsilon
  • mllama.vision.image_size, patch_size, max_num_tiles, num_channels

Special Handling

Gate Tensor Tanh Transformation

Gate tensors (attn_gate, ffn_gate, and positional embedding gates) are passed through a tanh activation during conversion. For the position_embd.gate tensor specifically, the value is further transformed to 1 - tanh(x). This pre-computation avoids runtime tanh operations.

Position Embedding Gate Duplication

The v.position_embd.gate tensor is duplicated to also create v.tile_position_embd.gate, with the original getting the 1 - tanh(x) transformation and the duplicate getting the standard tanh(x) transformation.

Vision Q/K Repacking

Vision encoder Q and K weight tensors undergo the same interleaved-to-contiguous head permutation as the text model, using the vision attention head count.

Text/Vision Tensor Separation

Tensors are separated into vision (v. and mm. prefixed) and text categories. Text tensors are processed through the inherited Llama tensor handler including Q/K repacking. Vision tensors receive gate transformations and Q/K repacking as needed.

Implementation Notes

The conversion is implemented in convert/convert_mllama.go via the mllamaModel struct which embeds llamaModel in its TextModel field. The repack method returns a closure that dispatches between the head permutation logic (for Q/K weights) and the tanh transformation logic (for gates) based on tensor name suffix matching.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment