Environment:Ggml org Llama cpp CUDA GPU Environment

Knowledge Sources	llama.cpp Build CUDA
Domains	Infrastructure, GPU_Acceleration
Last Updated	2026-02-14 22:00 GMT

Overview

NVIDIA CUDA GPU acceleration environment requiring CUDA Toolkit, CMake 3.18+, and a GPU with compute capability 5.0+ (Maxwell or newer) for GPU-accelerated inference in llama.cpp.

Description

This environment enables CUDA-based GPU offloading for llama.cpp inference. Model layers are offloaded to the GPU via the -ngl flag, dramatically improving token generation speed. The CUDA backend supports FlashAttention kernels, CUDA graphs, cuBLAS matrix multiplication, and multi-GPU tensor splitting. Compute capability determines which features are available (e.g., FP16 tensor cores on Volta+, INT8 tensor cores on Turing+).

Usage

Use this environment for any GPU-accelerated inference workflow. Enable by building with -DGGML_CUDA=ON. Required when offloading model layers to NVIDIA GPUs for text generation, chat, embedding extraction, server operation, and speculative decoding.

System Requirements

Category	Requirement	Notes
OS	Linux or Windows	macOS does not support CUDA
CMake	>= 3.18	Required for CMAKE_CUDA_ARCHITECTURES support
GPU	NVIDIA with Compute Capability >= 5.0	Maxwell or newer (GTX 900+, Tesla M40+)
VRAM	Depends on model size	4GB minimum for 7B Q4_0; 16GB+ for larger models
Driver	NVIDIA driver compatible with CUDA version	Check nvidia-smi output

Dependencies

System Packages

CUDA Toolkit (11.x or 12.x)
nvidia-driver (matching CUDA version)
cmake >= 3.18
C++17 compiler (nvcc-compatible)

CUDA Architecture by GPU Series

GPU Series	Compute Capability	Minimum CUDA Version
Maxwell (GTX 900)	50	CUDA 11+
Pascal (GTX 1000)	61	CUDA 11+
Volta (V100)	70	CUDA 11+
Turing (RTX 2000)	75	CUDA 11+
Ampere (RTX 3000)	86	CUDA 11.1+
Ada Lovelace (RTX 4000)	89	CUDA 11.8+
Blackwell (RTX 5000)	120	CUDA 12.8+

Credentials

No credentials are required for CUDA builds.

Quick Install

# Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

# Build with specific architecture (e.g., RTX 4090)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89

# Build with FlashAttention and all quant types
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON

Code Evidence

CMake minimum version for CUDA from ggml/src/ggml-cuda/CMakeLists.txt:1:

cmake_minimum_required(VERSION 3.18) # for CMAKE_CUDA_ARCHITECTURES

CUDA architecture defaults from ggml/src/ggml-cuda/CMakeLists.txt:

# Lowest CUDA 12 standard + desktop/mobile GPUs
set(CUDA_ARCHITECTURES "50-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-real")
if (CUDA_VERSION VERSION_GREATER_EQUAL "11.8")
    list(APPEND CUDA_ARCHITECTURES "89-real")  # RTX 4000
endif()
if (CUDA_VERSION VERSION_GREATER_EQUAL "12.8")
    list(APPEND CUDA_ARCHITECTURES "120a-real") # Blackwell
endif()

CUDA build options from ggml/src/ggml-cuda/CMakeLists.txt:

option(GGML_CUDA_FORCE_MMQ     "ggml: use mmq kernels instead of cuBLAS"  OFF)
option(GGML_CUDA_FORCE_CUBLAS  "ggml: always use cuBLAS instead of mmq"   OFF)
option(GGML_CUDA_FA            "ggml: compile FlashAttention CUDA kernels" ON)
option(GGML_CUDA_GRAPHS        "ggml: use CUDA graphs"                    ON)

Common Errors

Error Message	Cause	Solution
`CMAKE_CUDA_ARCHITECTURES must be non-empty`	No GPU detected during build	Set `-DCMAKE_CUDA_ARCHITECTURES=native` or specify GPU arch explicitly
`CUDA out of memory`	Model too large for GPU VRAM	Reduce `-ngl` (offload fewer layers) or use a smaller quantization
`no kernel image is available for execution`	Binary not compiled for this GPU arch	Rebuild with correct `CMAKE_CUDA_ARCHITECTURES` for your GPU
`CUDA driver version is insufficient`	NVIDIA driver too old for CUDA toolkit	Update NVIDIA driver to match CUDA version

Compatibility Notes

Multi-GPU: Supported via tensor splitting (--tensor-split). Peer-to-peer copies enabled by default (disable with GGML_CUDA_NO_PEER_COPY).
Unified Memory: On Linux, set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to use system RAM as overflow when VRAM is insufficient.
CUDA Graphs: Enabled by default (GGML_CUDA_GRAPHS=ON). Reduces kernel launch overhead for repeated inference.
FlashAttention: Enabled by default. Only compiles common quant types; use GGML_CUDA_FA_ALL_QUANTS=ON for all types.
cuBLAS vs MMQ: By default, the backend auto-selects between cuBLAS and custom MMQ kernels. Force one with GGML_CUDA_FORCE_CUBLAS or GGML_CUDA_FORCE_MMQ.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment