Implementation:Ggml org Ggml Cuda backend api
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (API Header) |
| Knowledge Sources | GGML |
| Domains | ML_Infrastructure, Tensor_Computing, GPU_Computing |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
Declares the CUDA/ROCm/MUSA GPU backend interface for running tensor operations on NVIDIA, AMD, and Moore Threads GPUs.
Description
ggml-cuda.h (47 lines) provides the public API for the GPU backend that supports three hardware platforms through conditional compilation:
- CUDA (NVIDIA) -- default, uses cuBLAS
- ROCm/HIP (AMD) -- when
GGML_USE_HIPis defined, uses hipBLAS - MUSA (Moore Threads) -- when
GGML_USE_MUSAis defined, uses muBLAS
The header defines name macros (GGML_CUDA_NAME, GGML_CUBLAS_NAME) that resolve to the appropriate platform strings, and GGML_CUDA_MAX_DEVICES = 16.
API functions:
ggml_backend_cuda_init(device)-- initialize a GPU backend for a specific deviceggml_backend_is_cuda()-- identify CUDA backendsggml_backend_cuda_buffer_type(device)-- device-specific buffer typeggml_backend_cuda_split_buffer_type(main_device, tensor_split)-- split buffer for multi-GPU tensor parallelism (distributes matrix rows across devices)ggml_backend_cuda_host_buffer_type()-- pinned host memory for fast CPU-GPU transfersggml_backend_cuda_get_device_count/description/memory()-- device enumerationggml_backend_cuda_register/unregister_host_buffer()-- pin/unpin host memory for DMA transfersggml_backend_cuda_reg()-- backend registration handle
Usage
Include this header in application code to initialize GPU backends, configure multi-GPU tensor parallelism, and manage GPU memory. This is the primary high-performance backend for NVIDIA GPU inference.
Code Reference
Source Location
GGML repo, file: include/ggml-cuda.h, 47 lines.
Signature
#define GGML_CUDA_MAX_DEVICES 16
GGML_BACKEND_API ggml_backend_t ggml_backend_cuda_init(int device);
GGML_BACKEND_API bool ggml_backend_is_cuda(ggml_backend_t backend);
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_cuda_buffer_type(int device);
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_cuda_split_buffer_type(
int main_device, const float * tensor_split);
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_cuda_host_buffer_type(void);
GGML_BACKEND_API int ggml_backend_cuda_get_device_count(void);
GGML_BACKEND_API void ggml_backend_cuda_get_device_description(
int device, char * description, size_t description_size);
GGML_BACKEND_API void ggml_backend_cuda_get_device_memory(
int device, size_t * free, size_t * total);
GGML_BACKEND_API bool ggml_backend_cuda_register_host_buffer(void * buffer, size_t size);
GGML_BACKEND_API void ggml_backend_cuda_unregister_host_buffer(void * buffer);
GGML_BACKEND_API ggml_backend_reg_t ggml_backend_cuda_reg(void);
Import
#include "ggml-cuda.h"
Dependencies
ggml.h-- core GGML typesggml-backend.h-- backend abstraction types
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
device |
int |
Yes | GPU device index (0 to GGML_CUDA_MAX_DEVICES - 1).
|
main_device |
int |
Yes (for split buffer) | Primary device for multi-GPU split. |
tensor_split |
const float * |
Yes (for split buffer) | Array of proportions for distributing tensor rows across devices. |
buffer |
void * |
Yes (for register/unregister) | Host memory buffer to pin/unpin. |
size |
size_t |
Yes (for register) | Size of the host buffer to pin. |
Outputs
| Output | Type | Description |
|---|---|---|
| Backend handle | ggml_backend_t |
Initialized CUDA/ROCm/MUSA backend for the specified device. |
| Buffer type | ggml_backend_buffer_type_t |
Device, split, or host buffer type interface. |
| Device count | int |
Number of available GPU devices. |
| Registration success | bool |
Whether host buffer pinning succeeded. |
Usage Examples
Single-GPU Setup
#include "ggml-cuda.h"
ggml_backend_t gpu = ggml_backend_cuda_init(0);
size_t free, total;
ggml_backend_cuda_get_device_memory(0, &free, &total);
printf("GPU 0: %.1f GB free / %.1f GB total\n", free / 1e9, total / 1e9);
Multi-GPU Tensor Parallelism
#include "ggml-cuda.h"
// Split tensors 60/40 across two GPUs
float split[] = { 0.6f, 0.4f };
ggml_backend_buffer_type_t split_buft =
ggml_backend_cuda_split_buffer_type(0, split);