Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Ggml org Llama cpp GPU Layer Offloading Verification

From Leeroopedia
Knowledge Sources
Domains Debugging, GPU_Acceleration
Last Updated 2026-02-14 22:00 GMT

Overview

Always verify GPU layer offloading is active by checking diagnostic output before assuming GPU acceleration is working; missing offloading is a common cause of slow inference.

Description

A frequent pitfall when using llama.cpp with GPU support is assuming the GPU is being used when it is not. This can happen because the binary was compiled without GPU support, the -ngl flag was not provided, or the GPU backend failed to initialize. The performance difference is dramatic: CPU-only inference on a large model can be 50-100x slower than GPU-accelerated inference. llama.cpp logs diagnostic information at startup that confirms whether layers are offloaded.

Usage

Use this heuristic when inference speed is unexpectedly slow and you expect GPU acceleration to be active. This is the first debugging step before investigating other performance issues.

The Insight (Rule of Thumb)

  • Action 1: Pass a very large -ngl value (e.g., -ngl 200000) to offload all possible layers to the GPU.
  • Action 2: Check the startup log for lines showing offloading N layers to GPU and total VRAM used: X MB.
  • Action 3: If no offloading lines appear, the binary was not compiled with GPU support. Rebuild with the correct backend flag.
  • Value: -ngl -1 means auto (let the system decide); -ngl -2 or lower means offload all layers.
  • Trade-off: Offloading too many layers can exhaust VRAM, causing fallback to CPU for remaining layers.

Reasoning

The performance tips documentation provides a concrete example from docs/development/token_generation_performance_tips.md:

When running llama, before it starts the inference work, it will output
diagnostic information that shows whether cuBLAS is offloading work to
the GPU. Look for these lines:

llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 17223 MB

If you see these lines, then the GPU is being used.

Performance impact comparison on a 30B model:

Configuration Tokens/sec Analysis
-ngl 2000000 (no -t) < 0.1 GPU offloaded but no CPU threads for remaining work
-t 7 (no GPU) 1.7 CPU-only, painfully slow for 30B model
-t 7 -ngl 2000000 8.7 Balanced GPU + CPU
-t 4 -ngl 2000000 9.1 Optimal for this hardware

The n_gpu_layers default from common/common.h:382:

int32_t n_gpu_layers = -1; // number of layers to store in VRAM, -1 is auto, <= -2 is all

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment