Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Ggml org Llama cpp CUDA GPU Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, GPU_Acceleration
Last Updated 2026-02-14 22:00 GMT

Overview

NVIDIA CUDA GPU acceleration environment requiring CUDA Toolkit, CMake 3.18+, and a GPU with compute capability 5.0+ (Maxwell or newer) for GPU-accelerated inference in llama.cpp.

Description

This environment enables CUDA-based GPU offloading for llama.cpp inference. Model layers are offloaded to the GPU via the -ngl flag, dramatically improving token generation speed. The CUDA backend supports FlashAttention kernels, CUDA graphs, cuBLAS matrix multiplication, and multi-GPU tensor splitting. Compute capability determines which features are available (e.g., FP16 tensor cores on Volta+, INT8 tensor cores on Turing+).

Usage

Use this environment for any GPU-accelerated inference workflow. Enable by building with -DGGML_CUDA=ON. Required when offloading model layers to NVIDIA GPUs for text generation, chat, embedding extraction, server operation, and speculative decoding.

System Requirements

Category Requirement Notes
OS Linux or Windows macOS does not support CUDA
CMake >= 3.18 Required for CMAKE_CUDA_ARCHITECTURES support
GPU NVIDIA with Compute Capability >= 5.0 Maxwell or newer (GTX 900+, Tesla M40+)
VRAM Depends on model size 4GB minimum for 7B Q4_0; 16GB+ for larger models
Driver NVIDIA driver compatible with CUDA version Check nvidia-smi output

Dependencies

System Packages

  • CUDA Toolkit (11.x or 12.x)
  • nvidia-driver (matching CUDA version)
  • cmake >= 3.18
  • C++17 compiler (nvcc-compatible)

CUDA Architecture by GPU Series

GPU Series Compute Capability Minimum CUDA Version
Maxwell (GTX 900) 50 CUDA 11+
Pascal (GTX 1000) 61 CUDA 11+
Volta (V100) 70 CUDA 11+
Turing (RTX 2000) 75 CUDA 11+
Ampere (RTX 3000) 86 CUDA 11.1+
Ada Lovelace (RTX 4000) 89 CUDA 11.8+
Blackwell (RTX 5000) 120 CUDA 12.8+

Credentials

No credentials are required for CUDA builds.

Quick Install

# Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)

# Build with specific architecture (e.g., RTX 4090)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89

# Build with FlashAttention and all quant types
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON

Code Evidence

CMake minimum version for CUDA from ggml/src/ggml-cuda/CMakeLists.txt:1:

cmake_minimum_required(VERSION 3.18) # for CMAKE_CUDA_ARCHITECTURES

CUDA architecture defaults from ggml/src/ggml-cuda/CMakeLists.txt:

# Lowest CUDA 12 standard + desktop/mobile GPUs
set(CUDA_ARCHITECTURES "50-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-real")
if (CUDA_VERSION VERSION_GREATER_EQUAL "11.8")
    list(APPEND CUDA_ARCHITECTURES "89-real")  # RTX 4000
endif()
if (CUDA_VERSION VERSION_GREATER_EQUAL "12.8")
    list(APPEND CUDA_ARCHITECTURES "120a-real") # Blackwell
endif()

CUDA build options from ggml/src/ggml-cuda/CMakeLists.txt:

option(GGML_CUDA_FORCE_MMQ     "ggml: use mmq kernels instead of cuBLAS"  OFF)
option(GGML_CUDA_FORCE_CUBLAS  "ggml: always use cuBLAS instead of mmq"   OFF)
option(GGML_CUDA_FA            "ggml: compile FlashAttention CUDA kernels" ON)
option(GGML_CUDA_GRAPHS        "ggml: use CUDA graphs"                    ON)

Common Errors

Error Message Cause Solution
CMAKE_CUDA_ARCHITECTURES must be non-empty No GPU detected during build Set -DCMAKE_CUDA_ARCHITECTURES=native or specify GPU arch explicitly
CUDA out of memory Model too large for GPU VRAM Reduce -ngl (offload fewer layers) or use a smaller quantization
no kernel image is available for execution Binary not compiled for this GPU arch Rebuild with correct CMAKE_CUDA_ARCHITECTURES for your GPU
CUDA driver version is insufficient NVIDIA driver too old for CUDA toolkit Update NVIDIA driver to match CUDA version

Compatibility Notes

  • Multi-GPU: Supported via tensor splitting (--tensor-split). Peer-to-peer copies enabled by default (disable with GGML_CUDA_NO_PEER_COPY).
  • Unified Memory: On Linux, set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to use system RAM as overflow when VRAM is insufficient.
  • CUDA Graphs: Enabled by default (GGML_CUDA_GRAPHS=ON). Reduces kernel launch overhead for repeated inference.
  • FlashAttention: Enabled by default. Only compiles common quant types; use GGML_CUDA_FA_ALL_QUANTS=ON for all types.
  • cuBLAS vs MMQ: By default, the backend auto-selects between cuBLAS and custom MMQ kernels. Force one with GGML_CUDA_FORCE_CUBLAS or GGML_CUDA_FORCE_MMQ.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment