Implementation:Ollama Ollama Mtmd SigLIP

Knowledge Sources	Ollama
Domains	Multimodal, VisionEncoder
Last Updated	2025-02-15 00:00 GMT

Overview

Multimodal graph builder for the SigLIP vision model, also serving as the adapter for Gemma 3, Idefics3, LFM2, and Janus Pro models.

Description

Implements clip_graph_siglip::build() which constructs a ggml computation graph for the SigLIP vision encoder. Uses a standard ViT with learned position embeddings (resizable via bilinear interpolation for LFM2) and LayerNorm. Supports multiple projector types: Gemma 3 (avg pool2d downsampling + RMS norm + soft_emb_norm + transposed linear projection), Idefics3 (pixel shuffle via build_patch_merge_permute + linear projection), LFM2 (pixel unshuffle + LayerNorm + MLP with GELU), Janus Pro (direct MLP with model's FFN op), and a default that triggers an abort. Each projector maps ViT output to the language model's embedding space with varying spatial reduction strategies.

Usage

Automatically selected when the loaded CLIP model uses PROJECTOR_TYPE_GEMMA3, PROJECTOR_TYPE_IDEFICS3, PROJECTOR_TYPE_LFM2, or PROJECTOR_TYPE_JANUS_PRO projector types.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/models/siglip.cpp
Lines: 1-81

Signature

struct clip_graph_siglip : public clip_graph {
    ggml_cgraph * build() override;
};

Import

#include "models.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	clip_model &	Yes	Loaded SigLIP CLIP model with weights
img	clip_image_f32 &	Yes	Preprocessed float image (square for Gemma 3)

Outputs

Name	Type	Description
ggml_cgraph *	pointer	Computation graph producing projected embeddings

Usage Examples

// Instantiated internally by clip.cpp for Gemma 3 and other SigLIP models
clip_graph_siglip graph(ctx, img);
ggml_cgraph * gf = graph.build();
// For Gemma 3: output is pool2d-downsampled and projected
// For Idefics3: output is pixel-shuffled and projected

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment