Implementation:Ollama Ollama Imagegen Gemma3 Vision

Knowledge Sources	Ollama
Domains	Image Generation, Vision
Last Updated	2025-02-15 00:00 GMT

Overview

Implements the SigLIP vision tower for Gemma 3 multimodal models, encoding images into patch embeddings via a transformer encoder.

Description

The vision.go file implements the SigLIP vision encoder used by Gemma 3 for multimodal inference. The VisionTower struct contains patch and position embeddings, a stack of VisionEncoderLayer transformers, and post-layer normalization. VisionEmbeddings performs conv2d patch embedding (patch_size stride, no padding) followed by positional embedding addition. Each VisionEncoderLayer applies pre-norm self-attention (VisionAttention with Q/K/V projections and scaled dot-product attention without causal mask) and pre-norm MLP with GELU activation. The output is [B, num_patches, hidden_size] for integration with the text model via a multi-modal projector.

Usage

Used by Gemma 3 multimodal models when processing image inputs alongside text in the MLX engine.

Code Reference

Source Location

Repository: Ollama
File: x/imagegen/models/gemma3/vision.go
Lines: 1-138

Signature

type VisionConfig struct {
	HiddenSize        int32 `json:"hidden_size"`
	ImageSize         int32 `json:"image_size"`
	IntermediateSize  int32 `json:"intermediate_size"`
	NumAttentionHeads int32 `json:"num_attention_heads"`
	NumHiddenLayers   int32 `json:"num_hidden_layers"`
	PatchSize         int32 `json:"patch_size"`
}

type VisionTower struct {
	Embeddings    *VisionEmbeddings
	Encoder       []*VisionEncoderLayer
	PostLayerNorm *nn.LayerNorm
	Config        *VisionConfig
}

func (v *VisionTower) Forward(x *mlx.Array) *mlx.Array

Import

import "github.com/ollama/ollama/x/imagegen/models/gemma3"

I/O Contract

Inputs

Name	Type	Required	Description
x	*mlx.Array	Yes	Preprocessed image tensor [B, H, W, C] in NHWC format

Outputs

Name	Type	Description
*mlx.Array	*mlx.Array	Patch embeddings [B, num_patches, hidden_size]

Usage Examples

visionTower := &gemma3.VisionTower{Config: visionCfg}
// ... load weights ...

// Process image: [B, 224, 224, 3] -> [B, 196, 1152]
imageEmbeddings := visionTower.Forward(normalizedImage)

Related Pages

Principle:Ollama_Ollama_ImageGeneration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment