Implementation:Ggml org Ggml Sam encode image
Appearance
Summary
Implementation of the SAM (Segment Anything Model) vision encoder in the GGML framework. The function builds a computational graph that encodes a preprocessed image into dense embeddings using a Vision Transformer architecture, leveraging GGML tensor operations for efficient inference.
API
struct ggml_cgraph * sam_encode_image(
const sam_model & model,
sam_state & state,
const sam_image_f32 & img
)
Source: examples/sam/sam.cpp:L1190-1409
Repository: https://github.com/ggml-org/ggml
Parameters
- model -- The loaded SAM model containing all weight tensors and hyperparameters
- state -- Runtime state holding backend buffers and intermediate computation results
- img -- Preprocessed 1024x1024 float32 image (normalized, resized, padded)
Returns
- ggml_cgraph* -- A pointer to the constructed computational graph for the image encoder
Produces
- state.embd_img -- Contains the resulting 64x64x256 image embeddings after graph evaluation
Architecture
The encoder follows the SAM ViT architecture, built as a GGML computational graph:
- Patch Embedding -- 16x16 conv2d projection converting image patches into embedding vectors
- Add Positional Encoding -- Learned positional embeddings added to patch representations
- N Transformer Blocks -- Each block consists of:
- LayerNorm -> Relative-Position Multi-Head Self-Attention (rel-pos MHSA) -> Residual Connection
- LayerNorm -> MLP with GELU activation -> Residual Connection
- Neck -- Feature refinement through:
- 1x1 conv -> LayerNorm -> 3x3 conv -> LayerNorm
YOLO CNN Approach (Alternative)
The GGML repository also contains a CNN-based vision encoder for YOLO:
- build_graph at
examples/yolo/yolov3-tiny.cpp:L393-453 - Uses apply_conv2d at
L170-183implementing: conv2d -> batch normalization -> leaky ReLU - Demonstrates the CNN backbone alternative to the ViT approach used by SAM
GGML Operations Used
- ggml_conv_2d -- 2D convolution for patch embedding and neck projections
- ggml_add -- Residual connections and bias addition
- ggml_norm -- Layer normalization
- ggml_mul_mat -- Matrix multiplication for attention Q/K/V projections and MLP layers
- ggml_permute -- Tensor dimension reordering for attention head manipulation
- ggml_soft_max -- Softmax for attention weight computation
- ggml_gelu -- GELU activation in MLP blocks
- ggml_cont -- Ensuring contiguous memory layout after permutations
Related
- Principle:Ggml_org_Ggml_Vision_Encoder_Execution
- Environment:Ggml_org_Ggml_C_Cpp_Build_Environment
- Environment:Ggml_org_Ggml_CUDA_GPU_Environment
Sources
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment