Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Ggml Sam encode image

From Leeroopedia


Template:KapsoImplementation

Summary

Implementation of the SAM (Segment Anything Model) vision encoder in the GGML framework. The function builds a computational graph that encodes a preprocessed image into dense embeddings using a Vision Transformer architecture, leveraging GGML tensor operations for efficient inference.

API

struct ggml_cgraph * sam_encode_image(
    const sam_model & model,
    sam_state & state,
    const sam_image_f32 & img
)

Source: examples/sam/sam.cpp:L1190-1409

Repository: https://github.com/ggml-org/ggml

Parameters

  • model -- The loaded SAM model containing all weight tensors and hyperparameters
  • state -- Runtime state holding backend buffers and intermediate computation results
  • img -- Preprocessed 1024x1024 float32 image (normalized, resized, padded)

Returns

  • ggml_cgraph* -- A pointer to the constructed computational graph for the image encoder

Produces

  • state.embd_img -- Contains the resulting 64x64x256 image embeddings after graph evaluation

Architecture

The encoder follows the SAM ViT architecture, built as a GGML computational graph:

  1. Patch Embedding -- 16x16 conv2d projection converting image patches into embedding vectors
  2. Add Positional Encoding -- Learned positional embeddings added to patch representations
  3. N Transformer Blocks -- Each block consists of:
    • LayerNorm -> Relative-Position Multi-Head Self-Attention (rel-pos MHSA) -> Residual Connection
    • LayerNorm -> MLP with GELU activation -> Residual Connection
  4. Neck -- Feature refinement through:
    • 1x1 conv -> LayerNorm -> 3x3 conv -> LayerNorm

YOLO CNN Approach (Alternative)

The GGML repository also contains a CNN-based vision encoder for YOLO:

  • build_graph at examples/yolo/yolov3-tiny.cpp:L393-453
  • Uses apply_conv2d at L170-183 implementing: conv2d -> batch normalization -> leaky ReLU
  • Demonstrates the CNN backbone alternative to the ViT approach used by SAM

GGML Operations Used

  • ggml_conv_2d -- 2D convolution for patch embedding and neck projections
  • ggml_add -- Residual connections and bias addition
  • ggml_norm -- Layer normalization
  • ggml_mul_mat -- Matrix multiplication for attention Q/K/V projections and MLP layers
  • ggml_permute -- Tensor dimension reordering for attention head manipulation
  • ggml_soft_max -- Softmax for attention weight computation
  • ggml_gelu -- GELU activation in MLP blocks
  • ggml_cont -- Ensuring contiguous memory layout after permutations

Related

Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment