Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Ggml Cuda backend api

From Leeroopedia
Revision as of 15:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Ggml_Cuda_backend_api.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Metadata

Field Value
Page Type Implementation (API Header)
Knowledge Sources GGML
Domains ML_Infrastructure, Tensor_Computing, GPU_Computing
Last Updated 2026-02-10 12:00 GMT

Overview

Declares the CUDA/ROCm/MUSA GPU backend interface for running tensor operations on NVIDIA, AMD, and Moore Threads GPUs.

Description

ggml-cuda.h (47 lines) provides the public API for the GPU backend that supports three hardware platforms through conditional compilation:

  • CUDA (NVIDIA) -- default, uses cuBLAS
  • ROCm/HIP (AMD) -- when GGML_USE_HIP is defined, uses hipBLAS
  • MUSA (Moore Threads) -- when GGML_USE_MUSA is defined, uses muBLAS

The header defines name macros (GGML_CUDA_NAME, GGML_CUBLAS_NAME) that resolve to the appropriate platform strings, and GGML_CUDA_MAX_DEVICES = 16.

API functions:

  • ggml_backend_cuda_init(device) -- initialize a GPU backend for a specific device
  • ggml_backend_is_cuda() -- identify CUDA backends
  • ggml_backend_cuda_buffer_type(device) -- device-specific buffer type
  • ggml_backend_cuda_split_buffer_type(main_device, tensor_split) -- split buffer for multi-GPU tensor parallelism (distributes matrix rows across devices)
  • ggml_backend_cuda_host_buffer_type() -- pinned host memory for fast CPU-GPU transfers
  • ggml_backend_cuda_get_device_count/description/memory() -- device enumeration
  • ggml_backend_cuda_register/unregister_host_buffer() -- pin/unpin host memory for DMA transfers
  • ggml_backend_cuda_reg() -- backend registration handle

Usage

Include this header in application code to initialize GPU backends, configure multi-GPU tensor parallelism, and manage GPU memory. This is the primary high-performance backend for NVIDIA GPU inference.

Code Reference

Source Location

GGML repo, file: include/ggml-cuda.h, 47 lines.

Signature

#define GGML_CUDA_MAX_DEVICES 16

GGML_BACKEND_API ggml_backend_t ggml_backend_cuda_init(int device);
GGML_BACKEND_API bool ggml_backend_is_cuda(ggml_backend_t backend);
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_cuda_buffer_type(int device);
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_cuda_split_buffer_type(
    int main_device, const float * tensor_split);
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_cuda_host_buffer_type(void);
GGML_BACKEND_API int  ggml_backend_cuda_get_device_count(void);
GGML_BACKEND_API void ggml_backend_cuda_get_device_description(
    int device, char * description, size_t description_size);
GGML_BACKEND_API void ggml_backend_cuda_get_device_memory(
    int device, size_t * free, size_t * total);
GGML_BACKEND_API bool ggml_backend_cuda_register_host_buffer(void * buffer, size_t size);
GGML_BACKEND_API void ggml_backend_cuda_unregister_host_buffer(void * buffer);
GGML_BACKEND_API ggml_backend_reg_t ggml_backend_cuda_reg(void);

Import

#include "ggml-cuda.h"

Dependencies

  • ggml.h -- core GGML types
  • ggml-backend.h -- backend abstraction types

I/O Contract

Inputs

Parameter Type Required Description
device int Yes GPU device index (0 to GGML_CUDA_MAX_DEVICES - 1).
main_device int Yes (for split buffer) Primary device for multi-GPU split.
tensor_split const float * Yes (for split buffer) Array of proportions for distributing tensor rows across devices.
buffer void * Yes (for register/unregister) Host memory buffer to pin/unpin.
size size_t Yes (for register) Size of the host buffer to pin.

Outputs

Output Type Description
Backend handle ggml_backend_t Initialized CUDA/ROCm/MUSA backend for the specified device.
Buffer type ggml_backend_buffer_type_t Device, split, or host buffer type interface.
Device count int Number of available GPU devices.
Registration success bool Whether host buffer pinning succeeded.

Usage Examples

Single-GPU Setup

#include "ggml-cuda.h"

ggml_backend_t gpu = ggml_backend_cuda_init(0);

size_t free, total;
ggml_backend_cuda_get_device_memory(0, &free, &total);
printf("GPU 0: %.1f GB free / %.1f GB total\n", free / 1e9, total / 1e9);

Multi-GPU Tensor Parallelism

#include "ggml-cuda.h"

// Split tensors 60/40 across two GPUs
float split[] = { 0.6f, 0.4f };
ggml_backend_buffer_type_t split_buft =
    ggml_backend_cuda_split_buffer_type(0, split);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment