Implementation:Ollama Ollama Mtmd Qwen3VL
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, VisionEncoder |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Multimodal graph builder for the Qwen3-VL vision model, extending the Qwen2-VL architecture with learned position embeddings, patch bias, and deepstack feature merging.
Description
Implements clip_graph_qwen3vl::build() which constructs a ggml computation graph for the Qwen3-VL vision encoder. Builds on the Qwen2-VL pattern with dual conv2d patch embedding with pixel unshuffling, but adds patch bias, resizable learned position embeddings (via bilinear interpolation), LayerNorm throughout, and M-RoPE position encoding. The key architectural addition is "deepstack" feature merging: certain layers have associated deepstack weights (norm, fc1, fc2) that extract features from intermediate representations, reshape and normalize them, apply an FFN, and concatenate them along the feature dimension. The final projection merges the spatial dimensions and concatenates the accumulated deepstack features with the standard output.
Usage
Automatically selected when the loaded CLIP model uses the PROJECTOR_TYPE_QWEN3VL projector type.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/models/qwen3vl.cpp
- Lines: 1-191
Signature
struct clip_graph_qwen3vl : public clip_graph {
ggml_cgraph * build() override;
};
Import
#include "models.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | clip_model & | Yes | Loaded Qwen3-VL CLIP model with ViT and deepstack weights |
| img | clip_image_f32 & | Yes | Preprocessed float image (dims divisible by patch_size * 2) |
| positions | ggml_tensor * | Yes | M-RoPE position tensor (n_patches * 4 elements) |
Outputs
| Name | Type | Description |
|---|---|---|
| ggml_cgraph * | pointer | Computation graph producing embeddings with deepstack features concatenated |
Usage Examples
// Instantiated internally by clip.cpp for Qwen3-VL models
clip_graph_qwen3vl graph(ctx, img);
ggml_cgraph * gf = graph.build();
// Output: [projection_dim + deepstack_dim, n_patches / 4] embeddings