Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Convert Qwen3Vl

From Leeroopedia
Knowledge Sources
Domains Model Conversion, GGUF Format
Last Updated 2025-02-15 00:00 GMT

Overview

Implements the GGUF model converter for the Qwen3 VL (Vision-Language) multimodal architecture, handling combined QKV decomposition, patch embedding reshaping, deepstack visual indexes, and preprocessor configuration parsing.

Description

The qwen3VLModel struct embeds qwen3Model for text model handling and adds a vision configuration with deepstack visual indexes, spatial merge size, temporal patch size, and image normalization parameters. The parseMore method reads preprocessor_config.json for vision preprocessing config (image size edges, mean, std). The KV method calls the parent qwen3Model.KV and overrides the architecture to qwen3vl (or qwen3vlmoe), adding vision-specific metadata. The Tensors method splits combined attn_qkv tensors into separate Q/K/V, reshapes patch embedding weights by merging the first two dimensions, and delegates remaining tensors to the parent qwen3Model.Tensors.

Usage

Invoked automatically when the model's architecture matches Qwen3VLForConditionalGeneration.

Code Reference

Source Location

  • Repository: Ollama
  • File: convert/convert_qwen3vl.go
  • Lines: 1-116

Signature

type qwen3VLModel struct {
    qwen3Model `json:"text_config"`
    VisionModel struct {
        Depth                  uint32    `json:"depth"`
        HiddenSize             uint32    `json:"hidden_size"`
        DeepstackVisualIndexes []int32   `json:"deepstack_visual_indexes"`
        Size struct {
            ShortestEdge uint32 `json:"shortest_edge"`
            LongestEdge  uint32 `json:"longest_edge"`
        } `json:"size"`
        ImageMean []float32 `json:"image_mean"`
        ImageStd  []float32 `json:"image_std"`
    } `json:"vision_config"`
}

func (m *qwen3VLModel) parseMore(fsys fs.FS) error
func (m *qwen3VLModel) KV(t *Tokenizer) KV
func (m *qwen3VLModel) Tensors(ts []Tensor) []*ggml.Tensor
func (m *qwen3VLModel) Replacements() []string

Import

import "github.com/ollama/ollama/convert"

I/O Contract

Inputs

Name Type Required Description
t *Tokenizer Yes Tokenizer data for GGUF metadata
ts []Tensor Yes Source tensors including combined QKV and patch embed tensors
fsys fs.FS Yes Filesystem for reading preprocessor_config.json

Outputs

Name Type Description
KV KV GGUF metadata with qwen3vl.* keys for text, vision, and image preprocessing
[]*ggml.Tensor slice Converted tensors with decomposed QKV and reshaped patch embeddings

Usage Examples

// Converter registered for Qwen3VLForConditionalGeneration
// Architecture is "qwen3vl" or "qwen3vlmoe" (if MoE enabled)
// attn_qkv -> attn_q + attn_k + attn_v (split along dim 0)
// patch_embed weight shape [C_out, C_in, H, W] -> [C_out*C_in, H, W]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment