Implementation:Ggml org Ggml Cuda backend api

Metadata

Field	Value
Page Type	Implementation (API Header)
Knowledge Sources	GGML
Domains	ML_Infrastructure, Tensor_Computing, GPU_Computing
Last Updated	2026-02-10 12:00 GMT

Overview

Declares the CUDA/ROCm/MUSA GPU backend interface for running tensor operations on NVIDIA, AMD, and Moore Threads GPUs.

Description

ggml-cuda.h (47 lines) provides the public API for the GPU backend that supports three hardware platforms through conditional compilation:

CUDA (NVIDIA) -- default, uses cuBLAS
ROCm/HIP (AMD) -- when GGML_USE_HIP is defined, uses hipBLAS
MUSA (Moore Threads) -- when GGML_USE_MUSA is defined, uses muBLAS

The header defines name macros (GGML_CUDA_NAME, GGML_CUBLAS_NAME) that resolve to the appropriate platform strings, and GGML_CUDA_MAX_DEVICES = 16.

API functions:

ggml_backend_cuda_init(device) -- initialize a GPU backend for a specific device
ggml_backend_is_cuda() -- identify CUDA backends
ggml_backend_cuda_buffer_type(device) -- device-specific buffer type
ggml_backend_cuda_split_buffer_type(main_device, tensor_split) -- split buffer for multi-GPU tensor parallelism (distributes matrix rows across devices)
ggml_backend_cuda_host_buffer_type() -- pinned host memory for fast CPU-GPU transfers
ggml_backend_cuda_get_device_count/description/memory() -- device enumeration
ggml_backend_cuda_register/unregister_host_buffer() -- pin/unpin host memory for DMA transfers
ggml_backend_cuda_reg() -- backend registration handle

Usage

Include this header in application code to initialize GPU backends, configure multi-GPU tensor parallelism, and manage GPU memory. This is the primary high-performance backend for NVIDIA GPU inference.

Code Reference

Source Location

GGML repo, file: include/ggml-cuda.h, 47 lines.

Signature

#define GGML_CUDA_MAX_DEVICES 16

GGML_BACKEND_API ggml_backend_t ggml_backend_cuda_init(int device);
GGML_BACKEND_API bool ggml_backend_is_cuda(ggml_backend_t backend);
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_cuda_buffer_type(int device);
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_cuda_split_buffer_type(
    int main_device, const float * tensor_split);
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_cuda_host_buffer_type(void);
GGML_BACKEND_API int  ggml_backend_cuda_get_device_count(void);
GGML_BACKEND_API void ggml_backend_cuda_get_device_description(
    int device, char * description, size_t description_size);
GGML_BACKEND_API void ggml_backend_cuda_get_device_memory(
    int device, size_t * free, size_t * total);
GGML_BACKEND_API bool ggml_backend_cuda_register_host_buffer(void * buffer, size_t size);
GGML_BACKEND_API void ggml_backend_cuda_unregister_host_buffer(void * buffer);
GGML_BACKEND_API ggml_backend_reg_t ggml_backend_cuda_reg(void);

Import

#include "ggml-cuda.h"

Dependencies

ggml.h -- core GGML types
ggml-backend.h -- backend abstraction types

I/O Contract

Inputs

Parameter	Type	Required	Description
`device`	`int`	Yes	GPU device index (0 to `GGML_CUDA_MAX_DEVICES - 1`).
`main_device`	`int`	Yes (for split buffer)	Primary device for multi-GPU split.
`tensor_split`	`const float *`	Yes (for split buffer)	Array of proportions for distributing tensor rows across devices.
`buffer`	`void *`	Yes (for register/unregister)	Host memory buffer to pin/unpin.
`size`	`size_t`	Yes (for register)	Size of the host buffer to pin.

Outputs

Output	Type	Description
Backend handle	`ggml_backend_t`	Initialized CUDA/ROCm/MUSA backend for the specified device.
Buffer type	`ggml_backend_buffer_type_t`	Device, split, or host buffer type interface.
Device count	`int`	Number of available GPU devices.
Registration success	`bool`	Whether host buffer pinning succeeded.

Usage Examples

Single-GPU Setup

#include "ggml-cuda.h"

ggml_backend_t gpu = ggml_backend_cuda_init(0);

size_t free, total;
ggml_backend_cuda_get_device_memory(0, &free, &total);
printf("GPU 0: %.1f GB free / %.1f GB total\n", free / 1e9, total / 1e9);

Multi-GPU Tensor Parallelism

#include "ggml-cuda.h"

// Split tensors 60/40 across two GPUs
float split[] = { 0.6f, 0.4f };
ggml_backend_buffer_type_t split_buft =
    ggml_backend_cuda_split_buffer_type(0, split);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment