Principle:Ollama Ollama GGUF Model Conversion Qwen3Vl
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, Multimodal |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Qwen 3 VL (Vision-Language) conversion handles the multimodal extension of the Qwen 3 architecture, combining a ViT-based vision encoder with the Qwen 3 text model and adding deepstack visual merging, transforming the complete model from HuggingFace SafeTensors to GGUF format while splitting fused QKV vision tensors and reshaping patch embeddings.
Core Concepts
Tensor Name Mapping
The converter inherits Qwen 3 replacements and adds:
model.language_-> (stripped, removing the language_ prefix)model.visual->vpatch_embed.proj->patch_embedblocks->blkattn.qkv->attn_qkvattn.proj->attn_outdeepstack_merger_list->deepstack_merger
Architecture-Specific Hyperparameters
The GGUF metadata uses a dynamic architecture prefix (qwen3vl for dense, qwen3vlmoe for MoE), inheriting all Qwen 3 text parameters plus:
Vision:
vision.block_count-- vision transformer depth (default 32)vision.embedding_length-- vision hidden sizevision.attention.head_count-- vision attention heads (default 16)vision.num_channels-- input channelsvision.patch_size-- patch size (default 14)vision.spatial_merge_size-- spatial merge factor (default 2)vision.attention.layer_norm_epsilon-- vision RMSNorm epsilon (default 1e-6)vision.rope.freq_base-- vision RoPE theta (default 10000)vision.temporal_patch_size-- temporal patch size (default 2)vision.deepstack_visual_indexes-- indices for deepstack visual feature merging
Preprocessor:
vision.shortest_edge/longest_edge-- image size constraintsvision.image_mean/image_std-- normalization parameters
Special Handling
Fused QKV Splitting
Vision encoder attn_qkv tensors are split along dimension 0 into three equal parts for Q, K, and V using the splitDim utility with string replacers.
Patch Embedding Reshaping
Patch embedding weight tensors have their first two dimensions merged: shape [out_ch, in_ch, H, W] becomes [out_ch * in_ch, H, W]. This flattening is applied to tensors matching the patch_embed pattern with a weight suffix.
Dynamic MoE Architecture
Like the base Qwen 3 converter, the architecture identifier dynamically switches between qwen3vl and qwen3vlmoe based on whether MoE parameters are configured.
Preprocessor Config
The converter reads preprocessor_config.json to extract image normalization parameters (mean, std), size constraints, and deepstack visual indexes, which are stored in the GGUF metadata.
Inherited Qwen 3 Processing
Non-vision tensors that are not QKV or patch embed are passed through to the base qwen3Model.Tensors() method, inheriting all MoE expert tensor handling (gate-up splitting, transposition) from the Qwen 3 converter.
Deepstack Visual Merging
The deepstack_visual_indexes specifies which intermediate vision transformer layers contribute features that are merged for the multimodal projector, enabling multi-scale visual representation.
Implementation Notes
The conversion is implemented in convert/convert_qwen3vl.go via the qwen3VLModel struct which embeds qwen3Model via the text_config JSON tag. The parseMore method reads preprocessor_config.json and unmarshals it into the VisionModel struct. The tensor processing separates vision-specific operations (QKV split, patch embed reshape) from text operations (delegated to parent).