Implementation:Ollama Ollama Mtmd Qwen3VL

Knowledge Sources	Ollama
Domains	Multimodal, VisionEncoder
Last Updated	2025-02-15 00:00 GMT

Overview

Multimodal graph builder for the Qwen3-VL vision model, extending the Qwen2-VL architecture with learned position embeddings, patch bias, and deepstack feature merging.

Description

Implements clip_graph_qwen3vl::build() which constructs a ggml computation graph for the Qwen3-VL vision encoder. Builds on the Qwen2-VL pattern with dual conv2d patch embedding with pixel unshuffling, but adds patch bias, resizable learned position embeddings (via bilinear interpolation), LayerNorm throughout, and M-RoPE position encoding. The key architectural addition is "deepstack" feature merging: certain layers have associated deepstack weights (norm, fc1, fc2) that extract features from intermediate representations, reshape and normalize them, apply an FFN, and concatenate them along the feature dimension. The final projection merges the spatial dimensions and concatenates the accumulated deepstack features with the standard output.

Usage

Automatically selected when the loaded CLIP model uses the PROJECTOR_TYPE_QWEN3VL projector type.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/models/qwen3vl.cpp
Lines: 1-191

Signature

struct clip_graph_qwen3vl : public clip_graph {
    ggml_cgraph * build() override;
};

Import

#include "models.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	clip_model &	Yes	Loaded Qwen3-VL CLIP model with ViT and deepstack weights
img	clip_image_f32 &	Yes	Preprocessed float image (dims divisible by patch_size * 2)
positions	ggml_tensor *	Yes	M-RoPE position tensor (n_patches * 4 elements)

Outputs

Name	Type	Description
ggml_cgraph *	pointer	Computation graph producing embeddings with deepstack features concatenated

Usage Examples

// Instantiated internally by clip.cpp for Qwen3-VL models
clip_graph_qwen3vl graph(ctx, img);
ggml_cgraph * gf = graph.build();
// Output: [projection_dim + deepstack_dim, n_patches / 4] embeddings

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment