Workflow:Ggml org Ggml Vision Model Inference

Knowledge Sources	GGML SAM README YOLO README
Domains	Computer_Vision, Inference, Object_Detection
Last Updated	2026-02-10 08:00 GMT

Overview

End-to-end process for running computer vision model inference using GGML, covering image segmentation with Segment Anything Model (SAM) and object detection with YOLOv3-tiny.

Description

This workflow demonstrates how to perform vision model inference using GGML's tensor operations and backend infrastructure. It covers two representative vision architectures: SAM (Segment Anything Model) for image segmentation using a Vision Transformer (ViT) encoder with prompt-based mask decoding, and YOLOv3-tiny for real-time object detection using a convolutional darknet backbone. Both models follow the same high-level pattern: convert model weights from their native format to GGML/GGUF, preprocess input images, build and execute the computation graph, and post-process outputs into human-readable results (segmentation masks or bounding boxes with class labels).

Key outputs:

SAM: Binary segmentation masks with intersection-over-union (IoU) and stability scores
YOLO: Bounding box detections with class labels and confidence scores
Annotated output images with visual results

Usage

Execute this workflow when you need to run vision model inference on images using GGML, either for image segmentation (identifying and isolating objects) or object detection (locating and classifying objects). This is appropriate when you want to deploy these models on consumer hardware without requiring GPU-specific frameworks, leveraging GGML's cross-platform backend system for hardware acceleration on CPU, CUDA, Metal, or Vulkan.

Execution Steps

Step 1: Convert Model Weights

Transform vision model weights from their native framework format to GGML-compatible format. For SAM: convert PyTorch checkpoint (.pth) files using the provided conversion script, which extracts ViT encoder weights, prompt encoder parameters, and mask decoder weights into a flat GGML binary with optional f16 quantization. For YOLO: convert Darknet weight files to GGUF format using the provided Python script, which reads the serialized convolutional layer weights and batch normalization parameters.

Key considerations:

SAM currently supports only the ViT-B checkpoint variant
YOLO conversion handles the Darknet-specific weight serialization format
F16 conversion for SAM approximately halves the model file size (185 MB for ViT-B)
Pre-converted YOLO models are available on HuggingFace

Step 2: Load and Preprocess Input Image

Read the input image file and transform it to the format expected by the model. For SAM: resize the image to 1024x1024 pixels while preserving aspect ratio (padding with zeros), normalize pixel values, and arrange as CHW (channels-height-width) tensor format. For YOLO: resize to 416x416 pixels, normalize to the 0-1 range, and convert from HWC to CHW format. Both models use the stb_image library for image I/O.

Key considerations:

Image preprocessing must exactly match the model's training pipeline
SAM uses ImageNet-standard mean and standard deviation normalization
YOLO letterboxes images to maintain aspect ratio within the fixed input size
Input image dimensions affect the coordinate mapping for output results

Step 3: Initialize Backend and Load Model

Set up the GGML backend infrastructure and load the converted model weights into tensor structures. Initialize the backend with available hardware acceleration, create a GGML context with sufficient memory for all model tensors plus intermediate computation, read the model binary file to populate weight tensors, and allocate the KV cache or intermediate buffers as needed.

Key considerations:

SAM requires approximately 200 MB of context memory for ViT-B
Backend selection follows the same pattern as other GGML examples
Model loading validates tensor dimensions against expected architecture
Memory allocation uses the ggml-alloc system for efficient buffer management

Step 4: Build and Execute Vision Graph

Construct the model-specific computation graph and execute it. For SAM: build the ViT image encoder graph (patch embedding, transformer blocks with windowed attention, neck projection), then build the lightweight mask decoder graph conditioned on the input point prompt. For YOLO: build the sequential darknet convolutional backbone graph (convolution, batch norm, leaky ReLU, max pooling layers) followed by the two-scale detection head.

Key considerations:

SAM's encoder is the most compute-intensive part; mask decoding is lightweight
YOLO processes all detection scales in a single forward pass
The computation graph is executed via ggml_backend_sched_graph_compute
Intermediate tensor shapes can be logged for debugging

Step 5: Post-process and Output Results

Extract raw model outputs and transform them into meaningful results. For SAM: threshold the predicted mask logits, filter masks by IoU and stability score thresholds, and write the binary mask to a PNG image file. For YOLO: decode bounding box predictions from anchor offsets, apply confidence thresholds, run non-maximum suppression (NMS) to eliminate duplicate detections, and draw annotated bounding boxes with class labels on the output image.

Key considerations:

SAM generates multiple candidate masks ranked by predicted IoU
YOLO NMS threshold and confidence threshold control detection sensitivity
Output coordinates must be mapped back to the original image dimensions
Performance timing helps benchmark different hardware backends

Execution Diagram

GitHub URL

Workflow Repository