Implementation:Ollama Ollama Imagegen Gemma3 Vision
| Knowledge Sources | |
|---|---|
| Domains | Image Generation, Vision |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the SigLIP vision tower for Gemma 3 multimodal models, encoding images into patch embeddings via a transformer encoder.
Description
The vision.go file implements the SigLIP vision encoder used by Gemma 3 for multimodal inference. The VisionTower struct contains patch and position embeddings, a stack of VisionEncoderLayer transformers, and post-layer normalization. VisionEmbeddings performs conv2d patch embedding (patch_size stride, no padding) followed by positional embedding addition. Each VisionEncoderLayer applies pre-norm self-attention (VisionAttention with Q/K/V projections and scaled dot-product attention without causal mask) and pre-norm MLP with GELU activation. The output is [B, num_patches, hidden_size] for integration with the text model via a multi-modal projector.
Usage
Used by Gemma 3 multimodal models when processing image inputs alongside text in the MLX engine.
Code Reference
Source Location
- Repository: Ollama
- File: x/imagegen/models/gemma3/vision.go
- Lines: 1-138
Signature
type VisionConfig struct {
HiddenSize int32 `json:"hidden_size"`
ImageSize int32 `json:"image_size"`
IntermediateSize int32 `json:"intermediate_size"`
NumAttentionHeads int32 `json:"num_attention_heads"`
NumHiddenLayers int32 `json:"num_hidden_layers"`
PatchSize int32 `json:"patch_size"`
}
type VisionTower struct {
Embeddings *VisionEmbeddings
Encoder []*VisionEncoderLayer
PostLayerNorm *nn.LayerNorm
Config *VisionConfig
}
func (v *VisionTower) Forward(x *mlx.Array) *mlx.Array
Import
import "github.com/ollama/ollama/x/imagegen/models/gemma3"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x | *mlx.Array | Yes | Preprocessed image tensor [B, H, W, C] in NHWC format |
Outputs
| Name | Type | Description |
|---|---|---|
| *mlx.Array | *mlx.Array | Patch embeddings [B, num_patches, hidden_size] |
Usage Examples
visionTower := &gemma3.VisionTower{Config: visionCfg}
// ... load weights ...
// Process image: [B, 224, 224, 3] -> [B, 196, 1152]
imageEmbeddings := visionTower.Forward(normalizedImage)