Implementation:Ggml org Ggml Vulkan backend
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (API Doc) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, GPU_Computing |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
Main implementation of the Vulkan GPU backend for GGML, providing cross-platform GPU acceleration via the Vulkan compute API on any Vulkan-capable GPU.
Description
ggml-vulkan.cpp is the most portable and largest GPU backend in GGML at approximately 16,000 lines. It provides:
- Vulkan initialization: Uses
vulkan.hpp(C++ Vulkan bindings) with a dynamic dispatch loader to avoid static linking to the Vulkan runtime. Includes polyfill definitions forVK_KHR_shader_bfloat16for compatibility with older SDK versions. - Pipeline management: The
vk_pipeline_structmanages shader modules, pipeline layouts, push constant sizes, workgroup configurations, and supports lazy parallel compilation. Pipelines can have 64-bit indexing variants linked in a list. - Vendor-specific optimizations: Detects GPU vendor via vendor IDs (
VK_VENDOR_ID_AMD = 0x1002,VK_VENDOR_ID_APPLE = 0x106b,VK_VENDOR_ID_INTEL = 0x8086,VK_VENDOR_ID_NVIDIA = 0x10de) to enable vendor-specific code paths. - Operation dispatch: Supports a comprehensive set of GGML operations via push constant structs (
vk_mat_mat_push_constants,vk_flash_attn_push_constants,vk_op_rope_push_constants, etc.) that parameterize the compute shaders. - Operation fusion: Supports fusing consecutive add operations (
MAX_FUSED_ADDSderived fromMAX_PARAMETER_COUNT = 12). - Synchronization: Platform-specific yield intrinsics (
_mm_pauseon x86,__yieldon ARM) for efficient spin-wait synchronization during GPU command submission. - Memory management: Full buffer lifecycle including device memory, host-pinned memory, staging buffers for CPU-GPU transfers, and memory logging for debugging.
- Shader loading: Pre-compiled SPIR-V shaders are embedded via
ggml-vulkan-shaders.hpp, generated by thevulkan-shaders-gentool.
The backend runs on any Vulkan-capable GPU across Windows, Linux, macOS (via MoltenVK), and Android.
Usage
Users initialize the Vulkan backend by calling ggml_backend_vk_init(dev_num). The backend is typically discovered automatically by ggml_backend_load_all(). Multiple Vulkan devices can be used simultaneously (up to 16).
Code Reference
Source Location
GGML repo, file: src/ggml-vulkan/ggml-vulkan.cpp (16086 lines).
Signatures
ggml_backend_t ggml_backend_vk_init(size_t dev_num);
bool ggml_backend_is_vk(ggml_backend_t backend);
ggml_backend_buffer_type_t ggml_backend_vk_buffer_type(size_t dev_num);
ggml_backend_buffer_type_t ggml_backend_vk_host_buffer_type(void);
ggml_backend_reg_t ggml_backend_vk_reg(void);
Import
#include "ggml-vulkan.h"
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
dev_num |
size_t |
Yes | Vulkan device index (0-based). Selects which GPU to use when multiple Vulkan-capable devices are present. |
Outputs
| Output | Type | Description |
|---|---|---|
| Backend handle | ggml_backend_t |
Opaque handle to the initialized Vulkan backend for use with the GGML scheduler. |
| Buffer type | ggml_backend_buffer_type_t |
Buffer type for Vulkan device memory or host-pinned memory. |
| Registration handle | ggml_backend_reg_t |
Backend registration for the auto-discovery system. |
Usage Examples
#include "ggml-vulkan.h"
#include "ggml-backend.h"
// Query available devices
int n_devices = ggml_backend_vk_get_device_count();
char desc[256];
ggml_backend_vk_get_device_description(0, desc, sizeof(desc));
// Initialize the first Vulkan device
ggml_backend_t vk_backend = ggml_backend_vk_init(0);
if (vk_backend && ggml_backend_is_vk(vk_backend)) {
// Query device memory
size_t free_mem, total_mem;
ggml_backend_vk_get_device_memory(0, &free_mem, &total_mem);
// Use with scheduler
ggml_backend_sched_t sched = ggml_backend_sched_new(
&vk_backend, NULL, 1, GGML_DEFAULT_GRAPH_SIZE, false);
ggml_backend_sched_graph_compute(sched, graph);
ggml_backend_sched_free(sched);
ggml_backend_free(vk_backend);
}