Principle:Ggml org Ggml Vision Encoder Execution

Summary

Processing images through a vision encoder network to produce dense feature representations. A vision encoder takes raw pixel input and transforms it into compact, semantically rich embeddings that downstream tasks (segmentation, detection, classification) can consume.

Theory

Vision Transformer (ViT)

The Vision Transformer (ViT) architecture patches an image into fixed-size patches, projects each patch to an embedding vector, and processes the sequence through transformer blocks with self-attention.

ViT Architecture Pipeline:

Patch Embedding -- A conv2d projection splits the image into non-overlapping patches (e.g., 16x16 pixels) and linearly projects each patch into an embedding dimension.
Positional Encoding -- Learned or sinusoidal positional encodings are added to preserve spatial information.
N Transformer Blocks -- Each block follows the structure:
- LayerNorm -> Multi-Head Self-Attention (MHSA) -> Residual Connection
- LayerNorm -> MLP (Feed-Forward Network) -> Residual Connection
Neck -- Convolutional projections (e.g., 1x1 conv followed by 3x3 conv) reduce dimensionality and refine features for downstream use.

Alternative: CNN Backbone

Convolutional Neural Network (CNN) backbones follow a different paradigm:

conv2d -> batch normalization -> activation -> pooling, repeated across multiple stages
Produces hierarchical feature maps at decreasing spatial resolutions
Each stage captures progressively more abstract features

Output

The vision encoder produces a dense feature map / embeddings that encode both spatial and semantic information about the input image.

Examples of encoder outputs:

SAM (Segment Anything Model) -- Produces 64x64x256 image embeddings from a 1024x1024 input image
YOLO (You Only Look Once) -- Produces multi-scale feature maps at resolutions such as 13x13 and 26x26 for object detection at different scales

Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment