Implementation:Ggml org Ggml Sam build fast graph
Appearance
Summary
sam_build_fast_graph is a C++ function that constructs a GGML computation graph for SAM's prompt-guided mask decoding. It combines prompt encoding and mask decoding into a single graph that, when computed, produces low-resolution segmentation masks and IoU quality predictions from pre-computed image embeddings and user-provided prompts.
API
struct ggml_cgraph * sam_build_fast_graph(
const sam_model & model,
sam_state & state,
int nx,
int ny,
const sam_prompt & prompt,
bool multimask_output
)
Source
- File:
examples/sam/sam.cpp, lines 2005-2094 - Repository: https://github.com/ggml-org/ggml
Parameters
| Parameter | Type | Description |
|---|---|---|
model |
const sam_model & |
Loaded SAM model containing all weights (image encoder, prompt encoder, mask decoder) |
state |
sam_state & |
Runtime state holding pre-computed image embeddings from the image encoder; also receives output tensors |
nx |
int |
Original image width in pixels |
ny |
int |
Original image height in pixels |
prompt |
const sam_prompt & |
User prompt containing point and/or bounding box coordinates for segmentation |
multimask_output |
bool |
When true, generates multiple candidate masks with IoU-based ranking; when false, generates a single mask |
Returns
- Type:
struct ggml_cgraph * - Description: A GGML computation graph that, when executed via
ggml_backend_sched_graph_compute, populatesstate.low_res_masks(low-resolution mask logits) andstate.iou_predictions(quality scores for each mask)
Implementation
The function orchestrates two major sub-functions into a unified computation graph:
1. Prompt Encoding (sam_encode_prompt)
- Source:
examples/sam/sam.cpp, lines 1449-1521 - Input: Point/box coordinates from the prompt
- Process:
- Convert point and box coordinates to normalized positions
- Apply Gaussian positional encoding (random Fourier features) to produce spatial embeddings
- Add learned foreground/background type embeddings to point encodings
- For boxes, encode top-left and bottom-right corners with learned corner-type embeddings
- Concatenate all point and box embeddings into a sparse embedding tensor
- Produce dense embeddings from learned parameters (not dependent on the specific prompt)
- Output: Sparse prompt embeddings and dense prompt embeddings
2. Mask Decoding (sam_decode_mask)
- Source:
examples/sam/sam.cpp, lines 1604-1844 - Input: Image embeddings from
state, sparse and dense prompt embeddings from encoding step - Process:
- Prepend IoU output token and mask output tokens to the sparse prompt embeddings
- Run 2-layer transformer with bidirectional cross-attention between prompt tokens and image embedding tokens
- Extract the IoU output token and pass through an MLP head to predict mask quality scores
- Extract mask output tokens and pass each through a hypernetwork MLP to produce per-mask weight vectors
- Reshape and upscale image features via two ConvTranspose2d layers (4x spatial upscaling, from 64x64 to 256x256)
- Compute dot product between hypernetwork weight vectors and upscaled features to produce low-resolution mask logits
- Output:
state.low_res_masksandstate.iou_predictions
Output Processing (sam_write_masks)
- Source:
examples/sam/sam.cpp, lines 1846-2002 - Description: Post-processes the decoder output into final segmentation masks
- Pipeline:
- Bilinear upscale low-resolution masks from 256x256 to 1024x1024 (SAM's internal high resolution)
- Crop and resize from 1024x1024 to the original image dimensions (
nxxny) - Threshold mask logits at 0.0 to produce binary masks
- Filter candidate masks by IoU prediction score to select the best mask(s)
- Write final masks to output (e.g., as image files)
Dependencies
- GGML core: Tensor operations, computation graph construction (
ggml_build_forward_expand,ggml_new_tensor, etc.) - sam_encode_prompt: Prompt encoding sub-function (lines 1449-1521)
- sam_decode_mask: Mask decoding sub-function (lines 1604-1844)
- sam_write_masks: Output post-processing (lines 1846-2002)
- Pre-computed state: Requires
stateto contain image embeddings from a prior call to the image encoder
Related
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment