Implementation:Ollama Ollama Convert MLLama
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, GGUF Format |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the GGUF model converter for the Meta MLlama (Llama 3.2 Vision) multimodal architecture, handling cross-attention layers, gated positional embeddings, and tanh-based gate tensor repacking.
Description
The mllamaModel struct embeds llamaModel for the text model and adds cross-attention layer indices and a full vision encoder configuration. The KV method reuses the embedded llama KV generation (remapping llama.* to mllama.*) and adds vision-specific metadata including block counts, global layer counts, intermediate layer indices, image/patch/tile sizes, and norm epsilon. The Tensors method handles three special cases: splitting v.position_embd.gate into position and tile gates (with 1-tanh and tanh transforms respectively), repacking vision Q/K weights with interleaved head reordering, and applying tanh transforms to pre/post tile position embedding gates. Text tensors are delegated to the embedded llamaModel.Tensors.
Usage
Invoked automatically when the model's architecture matches MllamaForConditionalGeneration.
Code Reference
Source Location
- Repository: Ollama
- File: convert/convert_mllama.go
- Lines: 1-179
Signature
type mllamaModel struct {
ModelParameters
TextModel struct {
llamaModel
CrossAttentionLayers []int32 `json:"cross_attention_layers"`
} `json:"text_config"`
VisionModel struct {
NumHiddenLayers uint32 `json:"num_hidden_layers"`
NumGlobalLayers uint32 `json:"num_global_layers"`
IntermediateLayersIndices []int32 `json:"intermediate_layers_indices"`
ImageSize uint32 `json:"image_size"`
MaxNumTiles uint32 `json:"max_num_tiles"`
} `json:"vision_config"`
}
func (m *mllamaModel) KV(t *Tokenizer) KV
func (m *mllamaModel) Replacements() []string
func (m *mllamaModel) Tensors(ts []Tensor) []*ggml.Tensor
func (m *mllamaModel) repack(name string) Repacker
Import
import "github.com/ollama/ollama/convert"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| t | *Tokenizer | Yes | Tokenizer data for GGUF metadata |
| ts | []Tensor | Yes | Source tensors from text model, vision encoder, and cross-attention layers |
Outputs
| Name | Type | Description |
|---|---|---|
| KV | KV | GGUF metadata with mllama.* keys for text, vision, and cross-attention config |
| []*ggml.Tensor | slice | Converted tensors with tanh-transformed gates and repacked Q/K weights |
Usage Examples
// Converter registered for MllamaForConditionalGeneration
// v.position_embd.gate is split into position_embd.gate (1-tanh) and tile_position_embd.gate (tanh)
// Vision Q/K weights get interleaved head reordering