Principle:Ollama Ollama MultimodalPipeline
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Vision-Language |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
The Multimodal Pipeline enables Ollama to process inputs that combine text with other modalities such as images and audio, routing non-text data through specialized encoders and projecting the resulting features into the language model's embedding space for unified generation.
Core Concepts
Vision Encoding
Image inputs are processed through a vision encoder, typically a CLIP (Contrastive Language-Image Pretraining) or SigLIP model. The encoder converts raw pixel data into a sequence of feature vectors that capture the semantic content of the image. These feature vectors are analogous to token embeddings in the text domain, enabling the language model to attend to visual information alongside textual tokens.
Image Preprocessing
Before vision encoding, images undergo preprocessing steps including resizing, center cropping, normalization, and optional tiling for high-resolution inputs. The preprocessing pipeline must match the exact configuration used during the vision encoder's training to produce valid feature representations. Architecture-specific preprocessors handle different input requirements (e.g., Qwen 2.5 VL uses dynamic resolution with aspect-ratio-preserving tiling).
Projection Layer
The vision encoder's output dimensionality typically differs from the language model's hidden dimension. A projection layer (often a linear transformation or small MLP) maps vision features to the language model's embedding space. This projection is trained alongside the rest of the model to align visual and textual representations, allowing the language model to process vision tokens as if they were text embeddings.
Multimodal Input Interleaving
The pipeline supports interleaving image and text tokens within a single sequence. Special image placeholder tokens in the text input mark where vision features should be inserted. During prompt construction, these placeholders are replaced with the projected vision features, creating a unified sequence that the language model processes in a single forward pass.
Audio Processing
For models that support audio input (e.g., Qwen Audio), a parallel audio encoding pipeline extracts mel-spectrogram features and processes them through an audio encoder. The resulting features are projected and interleaved with text tokens using the same mechanism as image features.
Implementation Notes
Image preprocessing logic resides in model/imageproc/ with architecture-specific processing functions. The CLIP and SigLIP encoder implementations are integrated into each vision-language model under model/models/ (e.g., model/models/mllama/ for Meta's Llama vision models, model/models/gemma3/ for Gemma 3's vision capabilities). Input construction and multimodal token interleaving are handled in model/input/. The multimodal rendering logic for different model families provides architecture-specific handling for how images are embedded into the token stream.