Environment:Ggml org Llama cpp CUDA GPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Acceleration |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
NVIDIA CUDA GPU acceleration environment requiring CUDA Toolkit, CMake 3.18+, and a GPU with compute capability 5.0+ (Maxwell or newer) for GPU-accelerated inference in llama.cpp.
Description
This environment enables CUDA-based GPU offloading for llama.cpp inference. Model layers are offloaded to the GPU via the -ngl flag, dramatically improving token generation speed. The CUDA backend supports FlashAttention kernels, CUDA graphs, cuBLAS matrix multiplication, and multi-GPU tensor splitting. Compute capability determines which features are available (e.g., FP16 tensor cores on Volta+, INT8 tensor cores on Turing+).
Usage
Use this environment for any GPU-accelerated inference workflow. Enable by building with -DGGML_CUDA=ON. Required when offloading model layers to NVIDIA GPUs for text generation, chat, embedding extraction, server operation, and speculative decoding.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux or Windows | macOS does not support CUDA |
| CMake | >= 3.18 | Required for CMAKE_CUDA_ARCHITECTURES support |
| GPU | NVIDIA with Compute Capability >= 5.0 | Maxwell or newer (GTX 900+, Tesla M40+) |
| VRAM | Depends on model size | 4GB minimum for 7B Q4_0; 16GB+ for larger models |
| Driver | NVIDIA driver compatible with CUDA version | Check nvidia-smi output |
Dependencies
System Packages
- CUDA Toolkit (11.x or 12.x)
nvidia-driver(matching CUDA version)cmake>= 3.18- C++17 compiler (nvcc-compatible)
CUDA Architecture by GPU Series
| GPU Series | Compute Capability | Minimum CUDA Version |
|---|---|---|
| Maxwell (GTX 900) | 50 | CUDA 11+ |
| Pascal (GTX 1000) | 61 | CUDA 11+ |
| Volta (V100) | 70 | CUDA 11+ |
| Turing (RTX 2000) | 75 | CUDA 11+ |
| Ampere (RTX 3000) | 86 | CUDA 11.1+ |
| Ada Lovelace (RTX 4000) | 89 | CUDA 11.8+ |
| Blackwell (RTX 5000) | 120 | CUDA 12.8+ |
Credentials
No credentials are required for CUDA builds.
Quick Install
# Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
# Build with specific architecture (e.g., RTX 4090)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
# Build with FlashAttention and all quant types
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON
Code Evidence
CMake minimum version for CUDA from ggml/src/ggml-cuda/CMakeLists.txt:1:
cmake_minimum_required(VERSION 3.18) # for CMAKE_CUDA_ARCHITECTURES
CUDA architecture defaults from ggml/src/ggml-cuda/CMakeLists.txt:
# Lowest CUDA 12 standard + desktop/mobile GPUs
set(CUDA_ARCHITECTURES "50-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-real")
if (CUDA_VERSION VERSION_GREATER_EQUAL "11.8")
list(APPEND CUDA_ARCHITECTURES "89-real") # RTX 4000
endif()
if (CUDA_VERSION VERSION_GREATER_EQUAL "12.8")
list(APPEND CUDA_ARCHITECTURES "120a-real") # Blackwell
endif()
CUDA build options from ggml/src/ggml-cuda/CMakeLists.txt:
option(GGML_CUDA_FORCE_MMQ "ggml: use mmq kernels instead of cuBLAS" OFF)
option(GGML_CUDA_FORCE_CUBLAS "ggml: always use cuBLAS instead of mmq" OFF)
option(GGML_CUDA_FA "ggml: compile FlashAttention CUDA kernels" ON)
option(GGML_CUDA_GRAPHS "ggml: use CUDA graphs" ON)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
CMAKE_CUDA_ARCHITECTURES must be non-empty |
No GPU detected during build | Set -DCMAKE_CUDA_ARCHITECTURES=native or specify GPU arch explicitly
|
CUDA out of memory |
Model too large for GPU VRAM | Reduce -ngl (offload fewer layers) or use a smaller quantization
|
no kernel image is available for execution |
Binary not compiled for this GPU arch | Rebuild with correct CMAKE_CUDA_ARCHITECTURES for your GPU
|
CUDA driver version is insufficient |
NVIDIA driver too old for CUDA toolkit | Update NVIDIA driver to match CUDA version |
Compatibility Notes
- Multi-GPU: Supported via tensor splitting (
--tensor-split). Peer-to-peer copies enabled by default (disable withGGML_CUDA_NO_PEER_COPY). - Unified Memory: On Linux, set
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1to use system RAM as overflow when VRAM is insufficient. - CUDA Graphs: Enabled by default (
GGML_CUDA_GRAPHS=ON). Reduces kernel launch overhead for repeated inference. - FlashAttention: Enabled by default. Only compiles common quant types; use
GGML_CUDA_FA_ALL_QUANTS=ONfor all types. - cuBLAS vs MMQ: By default, the backend auto-selects between cuBLAS and custom MMQ kernels. Force one with
GGML_CUDA_FORCE_CUBLASorGGML_CUDA_FORCE_MMQ.