Implementation:Ollama Ollama Mtmd SigLIP
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, VisionEncoder |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Multimodal graph builder for the SigLIP vision model, also serving as the adapter for Gemma 3, Idefics3, LFM2, and Janus Pro models.
Description
Implements clip_graph_siglip::build() which constructs a ggml computation graph for the SigLIP vision encoder. Uses a standard ViT with learned position embeddings (resizable via bilinear interpolation for LFM2) and LayerNorm. Supports multiple projector types: Gemma 3 (avg pool2d downsampling + RMS norm + soft_emb_norm + transposed linear projection), Idefics3 (pixel shuffle via build_patch_merge_permute + linear projection), LFM2 (pixel unshuffle + LayerNorm + MLP with GELU), Janus Pro (direct MLP with model's FFN op), and a default that triggers an abort. Each projector maps ViT output to the language model's embedding space with varying spatial reduction strategies.
Usage
Automatically selected when the loaded CLIP model uses PROJECTOR_TYPE_GEMMA3, PROJECTOR_TYPE_IDEFICS3, PROJECTOR_TYPE_LFM2, or PROJECTOR_TYPE_JANUS_PRO projector types.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/models/siglip.cpp
- Lines: 1-81
Signature
struct clip_graph_siglip : public clip_graph {
ggml_cgraph * build() override;
};
Import
#include "models.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | clip_model & | Yes | Loaded SigLIP CLIP model with weights |
| img | clip_image_f32 & | Yes | Preprocessed float image (square for Gemma 3) |
Outputs
| Name | Type | Description |
|---|---|---|
| ggml_cgraph * | pointer | Computation graph producing projected embeddings |
Usage Examples
// Instantiated internally by clip.cpp for Gemma 3 and other SigLIP models
clip_graph_siglip graph(ctx, img);
ggml_cgraph * gf = graph.build();
// For Gemma 3: output is pool2d-downsampled and projected
// For Idefics3: output is pixel-shuffled and projected