Principle:Ollama Ollama GGUF Model Conversion MLLama
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, Multimodal |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Meta Llama multimodal (MLLama) conversion handles the vision-language architecture combining a Llama text backbone with cross-attention layers and a multi-scale vision encoder (local + global transformer layers), transforming the complete pipeline from HuggingFace SafeTensors to GGUF format with gate tensor tanh transformations and Q/K repacking for the vision encoder.
Core Concepts
Tensor Name Mapping
The converter inherits the base Llama replacements and adds:
language_model.-> (stripped)gate_attn->attn_gategate_ffn->ffn_gatecross_attn.->cross_attn_vision_model->vclass_embedding->class_embdpatch_embedding->patch_embdgated_positional_embedding.tile_embedding->tile_position_embdgated_positional_embedding.embedding->position_embd.weightpre_tile_positional_embedding->pre_tile_position_embdpost_tile_positional_embedding->post_tile_position_embdglobal_transformer.layers->global.blktransformer.layers->blkmlp.fc1->ffn_upmlp.fc2->ffn_downmulti_modal_projector->mm.0
Architecture-Specific Hyperparameters
The GGUF metadata is written under the mllama.* namespace, inheriting Llama text parameters (rewritten from llama. to mllama.) plus:
mllama.attention.cross_attention_layers-- indices of layers with cross-attentionmllama.vision.block_count-- local vision transformer layersmllama.vision.global.block_count-- global vision transformer layersmllama.vision.intermediate_layers_indices-- indices of intermediate feature layersmllama.vision.embedding_length,feed_forward_lengthmllama.vision.attention.head_count,layer_norm_epsilonmllama.vision.image_size,patch_size,max_num_tiles,num_channels
Special Handling
Gate Tensor Tanh Transformation
Gate tensors (attn_gate, ffn_gate, and positional embedding gates) are passed through a tanh activation during conversion. For the position_embd.gate tensor specifically, the value is further transformed to 1 - tanh(x). This pre-computation avoids runtime tanh operations.
Position Embedding Gate Duplication
The v.position_embd.gate tensor is duplicated to also create v.tile_position_embd.gate, with the original getting the 1 - tanh(x) transformation and the duplicate getting the standard tanh(x) transformation.
Vision Q/K Repacking
Vision encoder Q and K weight tensors undergo the same interleaved-to-contiguous head permutation as the text model, using the vision attention head count.
Text/Vision Tensor Separation
Tensors are separated into vision (v. and mm. prefixed) and text categories. Text tensors are processed through the inherited Llama tensor handler including Q/K repacking. Vision tensors receive gate transformations and Q/K repacking as needed.
Implementation Notes
The conversion is implemented in convert/convert_mllama.go via the mllamaModel struct which embeds llamaModel in its TextModel field. The repack method returns a closure that dispatches between the head permutation logic (for Q/K weights) and the tanh transformation logic (for gates) based on tensor name suffix matching.