Principle:Ollama Ollama GGUF Model Conversion Mistral
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, Mistral |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Mistral conversion (Mistral 3 multimodal) handles the Mistral architecture including sliding window attention, vision encoder integration, multimodal projector, and advanced RoPE scaling configurations (YaRN, mscale, llama4_scaling_beta), transforming the complete vision-language model from HuggingFace SafeTensors to GGUF format.
Core Concepts
Tensor Name Mapping
The converter applies the following HuggingFace-to-GGUF tensor name replacements:
language_model.model.norm->output_normlanguage_model.model./language_model.-> (stripped)layers->blkvision_tower->vln_pre->encoder_normembed_tokens->token_embdself_attn.{q,k,v}_proj->attn_{q,k,v}self_attn.o_proj->attn_outputattention.{q,k,v}_proj->attn_{q,k,v}(alternate naming)feed_forward.{gate,down,up}_proj->ffn_{gate,down,up}(alternate naming)multi_modal_projector->mmlm_head->output
Architecture-Specific Hyperparameters
The GGUF metadata is written under the mistral3.* namespace:
Text:
mistral3.vocab_size-- vocabulary sizemistral3.block_count,context_length,embedding_length,feed_forward_lengthmistral3.attention.head_count,head_count_kv,key_length,value_lengthmistral3.rope.dimension_count-- head dimension (or hidden_size / num_heads)mistral3.rope.freq_base-- RoPE thetamistral3.rope.scaling.*-- factor, type, beta_fast, beta_slow, mscale, mscale_all_dim, original_context_lengthmistral3.rope.scaling_beta-- Llama 4-style scaling beta
Vision:
mistral3.vision.block_count,embedding_length,feed_forward_lengthmistral3.vision.attention.head_count,key_lengthmistral3.vision.image_size,patch_size,num_channelsmistral3.vision.rope.freq_base-- separate RoPE theta for vision
Multimodal:
mistral3.image_token_index,spatial_merge_sizemistral3.mm.projector_bias,projector_hidden_act
Special Handling
Q/K Weight Repacking
Text model Q and K weight tensors (not vision tensors) are repacked from interleaved to contiguous head layout using the standard Llama-style permutation: reshape to [heads, 2, head_dim/2, hidden], transpose to [heads, head_dim/2, 2, hidden], then flatten.
Dual RoPE Configuration
The model supports separate RoPE parameters for text and vision encoders. The text model's RoPE can use various scaling types including YaRN, with optional mscale and mscale_all_dim parameters that only appear as pointers (may be absent).
Nested Config Structure
Parameters are organized under text_config and vision_config with a separate rope_parameters sub-structure containing the RoPE scaling configuration.
Implementation Notes
The conversion is implemented in convert/convert_mistral.go via the mistral3Model struct. The struct handles multiple naming conventions for tensor replacements (both self_attn and attention prefixes, both mlp and feed_forward prefixes) to support different checkpoint formats from Mistral AI.