Principle:Ollama Ollama GGUF Model Conversion MLLama

Knowledge Sources	Ollama
Domains	Model Conversion, Multimodal
Last Updated	2025-02-15 00:00 GMT

Overview

Meta Llama multimodal (MLLama) conversion handles the vision-language architecture combining a Llama text backbone with cross-attention layers and a multi-scale vision encoder (local + global transformer layers), transforming the complete pipeline from HuggingFace SafeTensors to GGUF format with gate tensor tanh transformations and Q/K repacking for the vision encoder.

Core Concepts

Tensor Name Mapping

The converter inherits the base Llama replacements and adds:

language_model. -> (stripped)
gate_attn -> attn_gate
gate_ffn -> ffn_gate
cross_attn. -> cross_attn_
vision_model -> v
class_embedding -> class_embd
patch_embedding -> patch_embd
gated_positional_embedding.tile_embedding -> tile_position_embd
gated_positional_embedding.embedding -> position_embd.weight
pre_tile_positional_embedding -> pre_tile_position_embd
post_tile_positional_embedding -> post_tile_position_embd
global_transformer.layers -> global.blk
transformer.layers -> blk
mlp.fc1 -> ffn_up
mlp.fc2 -> ffn_down
multi_modal_projector -> mm.0

Architecture-Specific Hyperparameters

The GGUF metadata is written under the mllama.* namespace, inheriting Llama text parameters (rewritten from llama. to mllama.) plus:

mllama.attention.cross_attention_layers -- indices of layers with cross-attention
mllama.vision.block_count -- local vision transformer layers
mllama.vision.global.block_count -- global vision transformer layers
mllama.vision.intermediate_layers_indices -- indices of intermediate feature layers
mllama.vision.embedding_length, feed_forward_length
mllama.vision.attention.head_count, layer_norm_epsilon
mllama.vision.image_size, patch_size, max_num_tiles, num_channels

Special Handling

Gate Tensor Tanh Transformation

Gate tensors (attn_gate, ffn_gate, and positional embedding gates) are passed through a tanh activation during conversion. For the position_embd.gate tensor specifically, the value is further transformed to 1 - tanh(x). This pre-computation avoids runtime tanh operations.

Position Embedding Gate Duplication

The v.position_embd.gate tensor is duplicated to also create v.tile_position_embd.gate, with the original getting the 1 - tanh(x) transformation and the duplicate getting the standard tanh(x) transformation.

Vision Q/K Repacking

Vision encoder Q and K weight tensors undergo the same interleaved-to-contiguous head permutation as the text model, using the vision attention head count.

Text/Vision Tensor Separation

Tensors are separated into vision (v. and mm. prefixed) and text categories. Text tensors are processed through the inherited Llama tensor handler including Q/K repacking. Vision tensors receive gate transformations and Q/K repacking as needed.

Implementation Notes

The conversion is implemented in convert/convert_mllama.go via the mllamaModel struct which embeds llamaModel in its TextModel field. The repack method returns a closure that dispatches between the head permutation logic (for Q/K weights) and the tanh transformation logic (for gates) based on tensor name suffix matching.

Related Pages

Implementation:Ollama_Ollama_Convert_MLLama

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment