Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Mtmd Qwen2VL

From Leeroopedia
Knowledge Sources
Domains Multimodal, VisionEncoder
Last Updated 2025-02-15 00:00 GMT

Overview

Multimodal graph builder for Qwen2-VL and Qwen2.5-VL vision models, implementing dual convolution, M-RoPE, optional window attention, and spatial merge projection.

Description

Implements clip_graph_qwen2vl::build() which constructs a ggml computation graph for the Qwen2-VL vision encoder. Uses dual conv2d patch embedding (two convolution layers summed with pixel unshuffling), M-RoPE (4-dimensional multi-rope) position encoding with ggml_rope_multi, optional window attention with masking and index-based reordering for efficiency, adaptive normalization (RMS for Qwen 2.5 VL, LayerNorm for Qwen 2 VL), and a spatial merge FFN projector that reshapes patches into groups of 4 and projects to the language model dimension. Window attention support includes inverse window index for reordering before the transformer and window index for restoring order after projection.

Usage

Automatically selected when the loaded CLIP model uses PROJECTOR_TYPE_QWEN2VL or PROJECTOR_TYPE_QWEN25VL projector types.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/tools/mtmd/models/qwen2vl.cpp
  • Lines: 1-183

Signature

struct clip_graph_qwen2vl : public clip_graph {
    ggml_cgraph * build() override;
};

Import

#include "models.h"

I/O Contract

Inputs

Name Type Required Description
model clip_model & Yes Loaded Qwen2-VL CLIP model with weights
img clip_image_f32 & Yes Preprocessed float image (dims divisible by patch_size * 2)
positions ggml_tensor * Yes M-RoPE position tensor (n_patches * 4 elements)

Outputs

Name Type Description
ggml_cgraph * pointer Computation graph producing spatially-merged LLM embeddings

Usage Examples

// Instantiated internally by clip.cpp for Qwen2-VL models
clip_graph_qwen2vl graph(ctx, img);
ggml_cgraph * gf = graph.build();
// Output: [projection_dim, n_patches / 4] embeddings

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment