Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Mtmd SigLIP

From Leeroopedia
Knowledge Sources
Domains Multimodal, VisionEncoder
Last Updated 2025-02-15 00:00 GMT

Overview

Multimodal graph builder for the SigLIP vision model, also serving as the adapter for Gemma 3, Idefics3, LFM2, and Janus Pro models.

Description

Implements clip_graph_siglip::build() which constructs a ggml computation graph for the SigLIP vision encoder. Uses a standard ViT with learned position embeddings (resizable via bilinear interpolation for LFM2) and LayerNorm. Supports multiple projector types: Gemma 3 (avg pool2d downsampling + RMS norm + soft_emb_norm + transposed linear projection), Idefics3 (pixel shuffle via build_patch_merge_permute + linear projection), LFM2 (pixel unshuffle + LayerNorm + MLP with GELU), Janus Pro (direct MLP with model's FFN op), and a default that triggers an abort. Each projector maps ViT output to the language model's embedding space with varying spatial reduction strategies.

Usage

Automatically selected when the loaded CLIP model uses PROJECTOR_TYPE_GEMMA3, PROJECTOR_TYPE_IDEFICS3, PROJECTOR_TYPE_LFM2, or PROJECTOR_TYPE_JANUS_PRO projector types.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/tools/mtmd/models/siglip.cpp
  • Lines: 1-81

Signature

struct clip_graph_siglip : public clip_graph {
    ggml_cgraph * build() override;
};

Import

#include "models.h"

I/O Contract

Inputs

Name Type Required Description
model clip_model & Yes Loaded SigLIP CLIP model with weights
img clip_image_f32 & Yes Preprocessed float image (square for Gemma 3)

Outputs

Name Type Description
ggml_cgraph * pointer Computation graph producing projected embeddings

Usage Examples

// Instantiated internally by clip.cpp for Gemma 3 and other SigLIP models
clip_graph_siglip graph(ctx, img);
ggml_cgraph * gf = graph.build();
// For Gemma 3: output is pool2d-downsampled and projected
// For Idefics3: output is pixel-shuffled and projected

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment