Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion Qwen25Vl

From Leeroopedia
Knowledge Sources
Domains Model Conversion, Multimodal
Last Updated 2025-02-15 00:00 GMT

Overview

Qwen 2.5 VL (Vision-Language) multimodal conversion handles the Qwen 2.5 architecture extended with a ViT-based vision encoder, transforming the complete model from HuggingFace SafeTensors to GGUF format while splitting fused QKV vision tensors, splitting temporal patch embeddings, and inheriting the Qwen 2 text model conversion logic.

Core Concepts

Tensor Name Mapping

The converter inherits Qwen 2 replacements and adds:

  • visual -> v
  • blocks -> blk
  • attn.proj -> attn_out
  • norm1 -> ln1
  • norm2 -> ln2

Architecture-Specific Hyperparameters

The GGUF metadata is written under the qwen25vl.* namespace, inheriting Qwen 2 text parameters (rewritten from qwen2. to qwen25vl.) plus:

Vision:

  • qwen25vl.vision.block_count -- vision transformer depth (default 32)
  • qwen25vl.vision.embedding_length -- vision hidden size
  • qwen25vl.vision.attention.head_count -- vision attention heads (default 16)
  • qwen25vl.vision.num_channels -- input channels
  • qwen25vl.vision.patch_size -- patch size (default 14)
  • qwen25vl.vision.spatial_merge_size -- spatial merge factor (default 2)
  • qwen25vl.vision.spatial_patch_size -- spatial patch size
  • qwen25vl.vision.window_size -- vision window attention size (default 112)
  • qwen25vl.vision.attention.layer_norm_epsilon -- vision LayerNorm epsilon (default 1e-6)
  • qwen25vl.vision.rope.freq_base -- vision RoPE theta (default 10000)
  • qwen25vl.vision.fullatt_block_indexes -- indices of full attention blocks (default [7, 15, 23, 31])
  • qwen25vl.vision.temporal_patch_size -- temporal patch size (default 2)

Special Handling

Temporal Patch Embedding Splitting

The patch_embed.proj weight tensor is split along dimension 2 (temporal) into two separate tensors (patch_embd_0 and patch_embd_1), with the singleton temporal dimension removed from each resulting shape.

Fused QKV Splitting

Vision encoder attn.qkv tensors are split along dimension 0 into three equal parts for Q, K, and V using the splitDim utility.

Inherited Qwen 2 Conversion

The text model conversion (including Q/K repacking, RoPE parameters, and standard tensor mapping) is inherited from the qwen2Model base struct. The KV generation calls the Qwen 2 KV generator and rewrites the namespace prefix.

Full Attention Block Indexes

The vision encoder uses windowed attention for most blocks but switches to full attention at specific layer indices. The default pattern is layers 7, 15, 23, and 31.

Implementation Notes

The conversion is implemented in convert/convert_qwen25vl.go via the qwen25VLModel struct which embeds qwen2Model. The splitDim and string replacer utilities are used extensively for tensor splitting operations. The temporal patch embedding split removes the temporal dimension from the shape array using slices.DeleteFunc.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment