Principle:Ollama Ollama GGUF Model Conversion Qwen25Vl

Knowledge Sources	Ollama
Domains	Model Conversion, Multimodal
Last Updated	2025-02-15 00:00 GMT

Overview

Qwen 2.5 VL (Vision-Language) multimodal conversion handles the Qwen 2.5 architecture extended with a ViT-based vision encoder, transforming the complete model from HuggingFace SafeTensors to GGUF format while splitting fused QKV vision tensors, splitting temporal patch embeddings, and inheriting the Qwen 2 text model conversion logic.

Core Concepts

Tensor Name Mapping

The converter inherits Qwen 2 replacements and adds:

visual -> v
blocks -> blk
attn.proj -> attn_out
norm1 -> ln1
norm2 -> ln2

Architecture-Specific Hyperparameters

The GGUF metadata is written under the qwen25vl.* namespace, inheriting Qwen 2 text parameters (rewritten from qwen2. to qwen25vl.) plus:

Vision:

qwen25vl.vision.block_count -- vision transformer depth (default 32)
qwen25vl.vision.embedding_length -- vision hidden size
qwen25vl.vision.attention.head_count -- vision attention heads (default 16)
qwen25vl.vision.num_channels -- input channels
qwen25vl.vision.patch_size -- patch size (default 14)
qwen25vl.vision.spatial_merge_size -- spatial merge factor (default 2)
qwen25vl.vision.spatial_patch_size -- spatial patch size
qwen25vl.vision.window_size -- vision window attention size (default 112)
qwen25vl.vision.attention.layer_norm_epsilon -- vision LayerNorm epsilon (default 1e-6)
qwen25vl.vision.rope.freq_base -- vision RoPE theta (default 10000)
qwen25vl.vision.fullatt_block_indexes -- indices of full attention blocks (default [7, 15, 23, 31])
qwen25vl.vision.temporal_patch_size -- temporal patch size (default 2)

Special Handling

Temporal Patch Embedding Splitting

The patch_embed.proj weight tensor is split along dimension 2 (temporal) into two separate tensors (patch_embd_0 and patch_embd_1), with the singleton temporal dimension removed from each resulting shape.

Fused QKV Splitting

Vision encoder attn.qkv tensors are split along dimension 0 into three equal parts for Q, K, and V using the splitDim utility.

Inherited Qwen 2 Conversion

The text model conversion (including Q/K repacking, RoPE parameters, and standard tensor mapping) is inherited from the qwen2Model base struct. The KV generation calls the Qwen 2 KV generator and rewrites the namespace prefix.

Full Attention Block Indexes

The vision encoder uses windowed attention for most blocks but switches to full attention at specific layer indices. The default pattern is layers 7, 15, 23, and 31.

Implementation Notes

The conversion is implemented in convert/convert_qwen25vl.go via the qwen25VLModel struct which embeds qwen2Model. The splitDim and string replacer utilities are used extensively for tensor splitting operations. The temporal patch embedding split removes the temporal dimension from the shape array using slices.DeleteFunc.

Related Pages

Implementation:Ollama_Ollama_Convert_Qwen25Vl

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment