Implementation:Ollama Ollama Mtmd Qwen2VL

Knowledge Sources	Ollama
Domains	Multimodal, VisionEncoder
Last Updated	2025-02-15 00:00 GMT

Overview

Multimodal graph builder for Qwen2-VL and Qwen2.5-VL vision models, implementing dual convolution, M-RoPE, optional window attention, and spatial merge projection.

Description

Implements clip_graph_qwen2vl::build() which constructs a ggml computation graph for the Qwen2-VL vision encoder. Uses dual conv2d patch embedding (two convolution layers summed with pixel unshuffling), M-RoPE (4-dimensional multi-rope) position encoding with ggml_rope_multi, optional window attention with masking and index-based reordering for efficiency, adaptive normalization (RMS for Qwen 2.5 VL, LayerNorm for Qwen 2 VL), and a spatial merge FFN projector that reshapes patches into groups of 4 and projects to the language model dimension. Window attention support includes inverse window index for reordering before the transformer and window index for restoring order after projection.

Usage

Automatically selected when the loaded CLIP model uses PROJECTOR_TYPE_QWEN2VL or PROJECTOR_TYPE_QWEN25VL projector types.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/models/qwen2vl.cpp
Lines: 1-183

Signature

struct clip_graph_qwen2vl : public clip_graph {
    ggml_cgraph * build() override;
};

Import

#include "models.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	clip_model &	Yes	Loaded Qwen2-VL CLIP model with weights
img	clip_image_f32 &	Yes	Preprocessed float image (dims divisible by patch_size * 2)
positions	ggml_tensor *	Yes	M-RoPE position tensor (n_patches * 4 elements)

Outputs

Name	Type	Description
ggml_cgraph *	pointer	Computation graph producing spatially-merged LLM embeddings

Usage Examples

// Instantiated internally by clip.cpp for Qwen2-VL models
clip_graph_qwen2vl graph(ctx, img);
ggml_cgraph * gf = graph.build();
// Output: [projection_dim, n_patches / 4] embeddings

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment