Principle:Ggml org Ggml Vision Encoder Execution
Summary
Processing images through a vision encoder network to produce dense feature representations. A vision encoder takes raw pixel input and transforms it into compact, semantically rich embeddings that downstream tasks (segmentation, detection, classification) can consume.
Theory
Vision Transformer (ViT)
The Vision Transformer (ViT) architecture patches an image into fixed-size patches, projects each patch to an embedding vector, and processes the sequence through transformer blocks with self-attention.
ViT Architecture Pipeline:
- Patch Embedding -- A conv2d projection splits the image into non-overlapping patches (e.g., 16x16 pixels) and linearly projects each patch into an embedding dimension.
- Positional Encoding -- Learned or sinusoidal positional encodings are added to preserve spatial information.
- N Transformer Blocks -- Each block follows the structure:
- LayerNorm -> Multi-Head Self-Attention (MHSA) -> Residual Connection
- LayerNorm -> MLP (Feed-Forward Network) -> Residual Connection
- Neck -- Convolutional projections (e.g., 1x1 conv followed by 3x3 conv) reduce dimensionality and refine features for downstream use.
Alternative: CNN Backbone
Convolutional Neural Network (CNN) backbones follow a different paradigm:
- conv2d -> batch normalization -> activation -> pooling, repeated across multiple stages
- Produces hierarchical feature maps at decreasing spatial resolutions
- Each stage captures progressively more abstract features
Output
The vision encoder produces a dense feature map / embeddings that encode both spatial and semantic information about the input image.
Examples of encoder outputs:
- SAM (Segment Anything Model) -- Produces 64x64x256 image embeddings from a 1024x1024 input image
- YOLO (You Only Look Once) -- Produces multi-scale feature maps at resolutions such as 13x13 and 26x26 for object detection at different scales