Principle:Ollama Ollama GGUF Model Conversion Qwen3Vl

Knowledge Sources	Ollama
Domains	Model Conversion, Multimodal
Last Updated	2025-02-15 00:00 GMT

Overview

Qwen 3 VL (Vision-Language) conversion handles the multimodal extension of the Qwen 3 architecture, combining a ViT-based vision encoder with the Qwen 3 text model and adding deepstack visual merging, transforming the complete model from HuggingFace SafeTensors to GGUF format while splitting fused QKV vision tensors and reshaping patch embeddings.

Core Concepts

Tensor Name Mapping

The converter inherits Qwen 3 replacements and adds:

model.language_ -> (stripped, removing the language_ prefix)
model.visual -> v
patch_embed.proj -> patch_embed
blocks -> blk
attn.qkv -> attn_qkv
attn.proj -> attn_out
deepstack_merger_list -> deepstack_merger

Architecture-Specific Hyperparameters

The GGUF metadata uses a dynamic architecture prefix (qwen3vl for dense, qwen3vlmoe for MoE), inheriting all Qwen 3 text parameters plus:

Vision:

vision.block_count -- vision transformer depth (default 32)
vision.embedding_length -- vision hidden size
vision.attention.head_count -- vision attention heads (default 16)
vision.num_channels -- input channels
vision.patch_size -- patch size (default 14)
vision.spatial_merge_size -- spatial merge factor (default 2)
vision.attention.layer_norm_epsilon -- vision RMSNorm epsilon (default 1e-6)
vision.rope.freq_base -- vision RoPE theta (default 10000)
vision.temporal_patch_size -- temporal patch size (default 2)
vision.deepstack_visual_indexes -- indices for deepstack visual feature merging

Preprocessor:

vision.shortest_edge / longest_edge -- image size constraints
vision.image_mean / image_std -- normalization parameters

Special Handling

Fused QKV Splitting

Vision encoder attn_qkv tensors are split along dimension 0 into three equal parts for Q, K, and V using the splitDim utility with string replacers.

Patch Embedding Reshaping

Patch embedding weight tensors have their first two dimensions merged: shape [out_ch, in_ch, H, W] becomes [out_ch * in_ch, H, W]. This flattening is applied to tensors matching the patch_embed pattern with a weight suffix.

Dynamic MoE Architecture

Like the base Qwen 3 converter, the architecture identifier dynamically switches between qwen3vl and qwen3vlmoe based on whether MoE parameters are configured.

Preprocessor Config

The converter reads preprocessor_config.json to extract image normalization parameters (mean, std), size constraints, and deepstack visual indexes, which are stored in the GGUF metadata.

Inherited Qwen 3 Processing

Non-vision tensors that are not QKV or patch embed are passed through to the base qwen3Model.Tensors() method, inheriting all MoE expert tensor handling (gate-up splitting, transposition) from the Qwen 3 converter.

Deepstack Visual Merging

The deepstack_visual_indexes specifies which intermediate vision transformer layers contribute features that are merged for the multimodal projector, enabling multi-scale visual representation.

Implementation Notes

The conversion is implemented in convert/convert_qwen3vl.go via the qwen3VLModel struct which embeds qwen3Model via the text_config JSON tag. The parseMore method reads preprocessor_config.json and unmarshals it into the VisionModel struct. The tensor processing separates vision-specific operations (QKV split, patch embed reshape) from text operations (delegated to parent).

Related Pages

Implementation:Ollama_Ollama_Convert_Qwen3Vl

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment