Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Mtmd Qwen3VL

From Leeroopedia
Knowledge Sources
Domains Multimodal, VisionEncoder
Last Updated 2025-02-15 00:00 GMT

Overview

Multimodal graph builder for the Qwen3-VL vision model, extending the Qwen2-VL architecture with learned position embeddings, patch bias, and deepstack feature merging.

Description

Implements clip_graph_qwen3vl::build() which constructs a ggml computation graph for the Qwen3-VL vision encoder. Builds on the Qwen2-VL pattern with dual conv2d patch embedding with pixel unshuffling, but adds patch bias, resizable learned position embeddings (via bilinear interpolation), LayerNorm throughout, and M-RoPE position encoding. The key architectural addition is "deepstack" feature merging: certain layers have associated deepstack weights (norm, fc1, fc2) that extract features from intermediate representations, reshape and normalize them, apply an FFN, and concatenate them along the feature dimension. The final projection merges the spatial dimensions and concatenates the accumulated deepstack features with the standard output.

Usage

Automatically selected when the loaded CLIP model uses the PROJECTOR_TYPE_QWEN3VL projector type.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/tools/mtmd/models/qwen3vl.cpp
  • Lines: 1-191

Signature

struct clip_graph_qwen3vl : public clip_graph {
    ggml_cgraph * build() override;
};

Import

#include "models.h"

I/O Contract

Inputs

Name Type Required Description
model clip_model & Yes Loaded Qwen3-VL CLIP model with ViT and deepstack weights
img clip_image_f32 & Yes Preprocessed float image (dims divisible by patch_size * 2)
positions ggml_tensor * Yes M-RoPE position tensor (n_patches * 4 elements)

Outputs

Name Type Description
ggml_cgraph * pointer Computation graph producing embeddings with deepstack features concatenated

Usage Examples

// Instantiated internally by clip.cpp for Qwen3-VL models
clip_graph_qwen3vl graph(ctx, img);
ggml_cgraph * gf = graph.build();
// Output: [projection_dim + deepstack_dim, n_patches / 4] embeddings

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment