Principle:Ollama Ollama GGUF Model Conversion Qwen25Vl
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, Multimodal |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Qwen 2.5 VL (Vision-Language) multimodal conversion handles the Qwen 2.5 architecture extended with a ViT-based vision encoder, transforming the complete model from HuggingFace SafeTensors to GGUF format while splitting fused QKV vision tensors, splitting temporal patch embeddings, and inheriting the Qwen 2 text model conversion logic.
Core Concepts
Tensor Name Mapping
The converter inherits Qwen 2 replacements and adds:
visual->vblocks->blkattn.proj->attn_outnorm1->ln1norm2->ln2
Architecture-Specific Hyperparameters
The GGUF metadata is written under the qwen25vl.* namespace, inheriting Qwen 2 text parameters (rewritten from qwen2. to qwen25vl.) plus:
Vision:
qwen25vl.vision.block_count-- vision transformer depth (default 32)qwen25vl.vision.embedding_length-- vision hidden sizeqwen25vl.vision.attention.head_count-- vision attention heads (default 16)qwen25vl.vision.num_channels-- input channelsqwen25vl.vision.patch_size-- patch size (default 14)qwen25vl.vision.spatial_merge_size-- spatial merge factor (default 2)qwen25vl.vision.spatial_patch_size-- spatial patch sizeqwen25vl.vision.window_size-- vision window attention size (default 112)qwen25vl.vision.attention.layer_norm_epsilon-- vision LayerNorm epsilon (default 1e-6)qwen25vl.vision.rope.freq_base-- vision RoPE theta (default 10000)qwen25vl.vision.fullatt_block_indexes-- indices of full attention blocks (default [7, 15, 23, 31])qwen25vl.vision.temporal_patch_size-- temporal patch size (default 2)
Special Handling
Temporal Patch Embedding Splitting
The patch_embed.proj weight tensor is split along dimension 2 (temporal) into two separate tensors (patch_embd_0 and patch_embd_1), with the singleton temporal dimension removed from each resulting shape.
Fused QKV Splitting
Vision encoder attn.qkv tensors are split along dimension 0 into three equal parts for Q, K, and V using the splitDim utility.
Inherited Qwen 2 Conversion
The text model conversion (including Q/K repacking, RoPE parameters, and standard tensor mapping) is inherited from the qwen2Model base struct. The KV generation calls the Qwen 2 KV generator and rewrites the namespace prefix.
Full Attention Block Indexes
The vision encoder uses windowed attention for most blocks but switches to full attention at specific layer indices. The default pattern is layers 7, 15, 23, and 31.
Implementation Notes
The conversion is implemented in convert/convert_qwen25vl.go via the qwen25VLModel struct which embeds qwen2Model. The splitDim and string replacer utilities are used extensively for tensor splitting operations. The temporal patch embedding split removes the temporal dimension from the shape array using slices.DeleteFunc.