Principle:Ollama Ollama GGUF Model Conversion Mistral

Knowledge Sources	Ollama
Domains	Model Conversion, Mistral
Last Updated	2025-02-15 00:00 GMT

Overview

Mistral conversion (Mistral 3 multimodal) handles the Mistral architecture including sliding window attention, vision encoder integration, multimodal projector, and advanced RoPE scaling configurations (YaRN, mscale, llama4_scaling_beta), transforming the complete vision-language model from HuggingFace SafeTensors to GGUF format.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

language_model.model.norm -> output_norm
language_model.model. / language_model. -> (stripped)
layers -> blk
vision_tower -> v
ln_pre -> encoder_norm
embed_tokens -> token_embd
self_attn.{q,k,v}_proj -> attn_{q,k,v}
self_attn.o_proj -> attn_output
attention.{q,k,v}_proj -> attn_{q,k,v} (alternate naming)
feed_forward.{gate,down,up}_proj -> ffn_{gate,down,up} (alternate naming)
multi_modal_projector -> mm
lm_head -> output

Architecture-Specific Hyperparameters

The GGUF metadata is written under the mistral3.* namespace:

Text:

mistral3.vocab_size -- vocabulary size
mistral3.block_count, context_length, embedding_length, feed_forward_length
mistral3.attention.head_count, head_count_kv, key_length, value_length
mistral3.rope.dimension_count -- head dimension (or hidden_size / num_heads)
mistral3.rope.freq_base -- RoPE theta
mistral3.rope.scaling.* -- factor, type, beta_fast, beta_slow, mscale, mscale_all_dim, original_context_length
mistral3.rope.scaling_beta -- Llama 4-style scaling beta

Vision:

mistral3.vision.block_count, embedding_length, feed_forward_length
mistral3.vision.attention.head_count, key_length
mistral3.vision.image_size, patch_size, num_channels
mistral3.vision.rope.freq_base -- separate RoPE theta for vision

Multimodal:

mistral3.image_token_index, spatial_merge_size
mistral3.mm.projector_bias, projector_hidden_act

Special Handling

Q/K Weight Repacking

Text model Q and K weight tensors (not vision tensors) are repacked from interleaved to contiguous head layout using the standard Llama-style permutation: reshape to [heads, 2, head_dim/2, hidden], transpose to [heads, head_dim/2, 2, hidden], then flatten.

Dual RoPE Configuration

The model supports separate RoPE parameters for text and vision encoders. The text model's RoPE can use various scaling types including YaRN, with optional mscale and mscale_all_dim parameters that only appear as pointers (may be absent).

Nested Config Structure

Parameters are organized under text_config and vision_config with a separate rope_parameters sub-structure containing the RoPE scaling configuration.

Implementation Notes

The conversion is implemented in convert/convert_mistral.go via the mistral3Model struct. The struct handles multiple naming conventions for tensor replacements (both self_attn and attention prefixes, both mlp and feed_forward prefixes) to support different checkpoint formats from Mistral AI.

Related Pages

Implementation:Ollama_Ollama_Convert_Mistral

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment