Implementation:Ollama Ollama Mtmd Qwen2VL
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, VisionEncoder |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Multimodal graph builder for Qwen2-VL and Qwen2.5-VL vision models, implementing dual convolution, M-RoPE, optional window attention, and spatial merge projection.
Description
Implements clip_graph_qwen2vl::build() which constructs a ggml computation graph for the Qwen2-VL vision encoder. Uses dual conv2d patch embedding (two convolution layers summed with pixel unshuffling), M-RoPE (4-dimensional multi-rope) position encoding with ggml_rope_multi, optional window attention with masking and index-based reordering for efficiency, adaptive normalization (RMS for Qwen 2.5 VL, LayerNorm for Qwen 2 VL), and a spatial merge FFN projector that reshapes patches into groups of 4 and projects to the language model dimension. Window attention support includes inverse window index for reordering before the transformer and window index for restoring order after projection.
Usage
Automatically selected when the loaded CLIP model uses PROJECTOR_TYPE_QWEN2VL or PROJECTOR_TYPE_QWEN25VL projector types.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/models/qwen2vl.cpp
- Lines: 1-183
Signature
struct clip_graph_qwen2vl : public clip_graph {
ggml_cgraph * build() override;
};
Import
#include "models.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | clip_model & | Yes | Loaded Qwen2-VL CLIP model with weights |
| img | clip_image_f32 & | Yes | Preprocessed float image (dims divisible by patch_size * 2) |
| positions | ggml_tensor * | Yes | M-RoPE position tensor (n_patches * 4 elements) |
Outputs
| Name | Type | Description |
|---|---|---|
| ggml_cgraph * | pointer | Computation graph producing spatially-merged LLM embeddings |
Usage Examples
// Instantiated internally by clip.cpp for Qwen2-VL models
clip_graph_qwen2vl graph(ctx, img);
ggml_cgraph * gf = graph.build();
// Output: [projection_dim, n_patches / 4] embeddings