Implementation:Ggml org Ggml Sam encode image

Summary

Implementation of the SAM (Segment Anything Model) vision encoder in the GGML framework. The function builds a computational graph that encodes a preprocessed image into dense embeddings using a Vision Transformer architecture, leveraging GGML tensor operations for efficient inference.

API

struct ggml_cgraph * sam_encode_image(
    const sam_model & model,
    sam_state & state,
    const sam_image_f32 & img
)

Source: examples/sam/sam.cpp:L1190-1409

Repository: https://github.com/ggml-org/ggml

Parameters

model -- The loaded SAM model containing all weight tensors and hyperparameters
state -- Runtime state holding backend buffers and intermediate computation results
img -- Preprocessed 1024x1024 float32 image (normalized, resized, padded)

Returns

ggml_cgraph* -- A pointer to the constructed computational graph for the image encoder

Produces

state.embd_img -- Contains the resulting 64x64x256 image embeddings after graph evaluation

Architecture

The encoder follows the SAM ViT architecture, built as a GGML computational graph:

Patch Embedding -- 16x16 conv2d projection converting image patches into embedding vectors
Add Positional Encoding -- Learned positional embeddings added to patch representations
N Transformer Blocks -- Each block consists of:
- LayerNorm -> Relative-Position Multi-Head Self-Attention (rel-pos MHSA) -> Residual Connection
- LayerNorm -> MLP with GELU activation -> Residual Connection
Neck -- Feature refinement through:
- 1x1 conv -> LayerNorm -> 3x3 conv -> LayerNorm

YOLO CNN Approach (Alternative)

The GGML repository also contains a CNN-based vision encoder for YOLO:

build_graph at examples/yolo/yolov3-tiny.cpp:L393-453
Uses apply_conv2d at L170-183 implementing: conv2d -> batch normalization -> leaky ReLU
Demonstrates the CNN backbone alternative to the ViT approach used by SAM

GGML Operations Used

ggml_conv_2d -- 2D convolution for patch embedding and neck projections
ggml_add -- Residual connections and bias addition
ggml_norm -- Layer normalization
ggml_mul_mat -- Matrix multiplication for attention Q/K/V projections and MLP layers
ggml_permute -- Tensor dimension reordering for attention head manipulation
ggml_soft_max -- Softmax for attention weight computation
ggml_gelu -- GELU activation in MLP blocks
ggml_cont -- Ensuring contiguous memory layout after permutations

Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment