Implementation:Ggml org Ggml Vulkan backend

Metadata

Field	Value
Page Type	Implementation (API Doc)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, GPU_Computing
Last Updated	2025-05-15 12:00 GMT

Overview

Main implementation of the Vulkan GPU backend for GGML, providing cross-platform GPU acceleration via the Vulkan compute API on any Vulkan-capable GPU.

Description

ggml-vulkan.cpp is the most portable and largest GPU backend in GGML at approximately 16,000 lines. It provides:

Vulkan initialization: Uses vulkan.hpp (C++ Vulkan bindings) with a dynamic dispatch loader to avoid static linking to the Vulkan runtime. Includes polyfill definitions for VK_KHR_shader_bfloat16 for compatibility with older SDK versions.
Pipeline management: The vk_pipeline_struct manages shader modules, pipeline layouts, push constant sizes, workgroup configurations, and supports lazy parallel compilation. Pipelines can have 64-bit indexing variants linked in a list.
Vendor-specific optimizations: Detects GPU vendor via vendor IDs (VK_VENDOR_ID_AMD = 0x1002, VK_VENDOR_ID_APPLE = 0x106b, VK_VENDOR_ID_INTEL = 0x8086, VK_VENDOR_ID_NVIDIA = 0x10de) to enable vendor-specific code paths.
Operation dispatch: Supports a comprehensive set of GGML operations via push constant structs (vk_mat_mat_push_constants, vk_flash_attn_push_constants, vk_op_rope_push_constants, etc.) that parameterize the compute shaders.
Operation fusion: Supports fusing consecutive add operations (MAX_FUSED_ADDS derived from MAX_PARAMETER_COUNT = 12).
Synchronization: Platform-specific yield intrinsics (_mm_pause on x86, __yield on ARM) for efficient spin-wait synchronization during GPU command submission.
Memory management: Full buffer lifecycle including device memory, host-pinned memory, staging buffers for CPU-GPU transfers, and memory logging for debugging.
Shader loading: Pre-compiled SPIR-V shaders are embedded via ggml-vulkan-shaders.hpp, generated by the vulkan-shaders-gen tool.

The backend runs on any Vulkan-capable GPU across Windows, Linux, macOS (via MoltenVK), and Android.

Usage

Users initialize the Vulkan backend by calling ggml_backend_vk_init(dev_num). The backend is typically discovered automatically by ggml_backend_load_all(). Multiple Vulkan devices can be used simultaneously (up to 16).

Code Reference

Source Location

GGML repo, file: src/ggml-vulkan/ggml-vulkan.cpp (16086 lines).

Signatures

ggml_backend_t ggml_backend_vk_init(size_t dev_num);
bool ggml_backend_is_vk(ggml_backend_t backend);
ggml_backend_buffer_type_t ggml_backend_vk_buffer_type(size_t dev_num);
ggml_backend_buffer_type_t ggml_backend_vk_host_buffer_type(void);
ggml_backend_reg_t ggml_backend_vk_reg(void);

Import

#include "ggml-vulkan.h"

I/O Contract

Inputs

Parameter	Type	Required	Description
`dev_num`	`size_t`	Yes	Vulkan device index (0-based). Selects which GPU to use when multiple Vulkan-capable devices are present.

Outputs

Output	Type	Description
Backend handle	`ggml_backend_t`	Opaque handle to the initialized Vulkan backend for use with the GGML scheduler.
Buffer type	`ggml_backend_buffer_type_t`	Buffer type for Vulkan device memory or host-pinned memory.
Registration handle	`ggml_backend_reg_t`	Backend registration for the auto-discovery system.

Usage Examples

#include "ggml-vulkan.h"
#include "ggml-backend.h"

// Query available devices
int n_devices = ggml_backend_vk_get_device_count();
char desc[256];
ggml_backend_vk_get_device_description(0, desc, sizeof(desc));

// Initialize the first Vulkan device
ggml_backend_t vk_backend = ggml_backend_vk_init(0);

if (vk_backend && ggml_backend_is_vk(vk_backend)) {
    // Query device memory
    size_t free_mem, total_mem;
    ggml_backend_vk_get_device_memory(0, &free_mem, &total_mem);

    // Use with scheduler
    ggml_backend_sched_t sched = ggml_backend_sched_new(
        &vk_backend, NULL, 1, GGML_DEFAULT_GRAPH_SIZE, false);

    ggml_backend_sched_graph_compute(sched, graph);

    ggml_backend_sched_free(sched);
    ggml_backend_free(vk_backend);
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment