Heuristic:Ggml org Llama cpp GPU Layer Offloading Verification
| Knowledge Sources | |
|---|---|
| Domains | Debugging, GPU_Acceleration |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
Always verify GPU layer offloading is active by checking diagnostic output before assuming GPU acceleration is working; missing offloading is a common cause of slow inference.
Description
A frequent pitfall when using llama.cpp with GPU support is assuming the GPU is being used when it is not. This can happen because the binary was compiled without GPU support, the -ngl flag was not provided, or the GPU backend failed to initialize. The performance difference is dramatic: CPU-only inference on a large model can be 50-100x slower than GPU-accelerated inference. llama.cpp logs diagnostic information at startup that confirms whether layers are offloaded.
Usage
Use this heuristic when inference speed is unexpectedly slow and you expect GPU acceleration to be active. This is the first debugging step before investigating other performance issues.
The Insight (Rule of Thumb)
- Action 1: Pass a very large
-nglvalue (e.g.,-ngl 200000) to offload all possible layers to the GPU. - Action 2: Check the startup log for lines showing
offloading N layers to GPUandtotal VRAM used: X MB. - Action 3: If no offloading lines appear, the binary was not compiled with GPU support. Rebuild with the correct backend flag.
- Value:
-ngl -1means auto (let the system decide);-ngl -2or lower means offload all layers. - Trade-off: Offloading too many layers can exhaust VRAM, causing fallback to CPU for remaining layers.
Reasoning
The performance tips documentation provides a concrete example from docs/development/token_generation_performance_tips.md:
When running llama, before it starts the inference work, it will output
diagnostic information that shows whether cuBLAS is offloading work to
the GPU. Look for these lines:
llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 17223 MB
If you see these lines, then the GPU is being used.
Performance impact comparison on a 30B model:
| Configuration | Tokens/sec | Analysis |
|---|---|---|
-ngl 2000000 (no -t) |
< 0.1 | GPU offloaded but no CPU threads for remaining work |
-t 7 (no GPU) |
1.7 | CPU-only, painfully slow for 30B model |
-t 7 -ngl 2000000 |
8.7 | Balanced GPU + CPU |
-t 4 -ngl 2000000 |
9.1 | Optimal for this hardware |
The n_gpu_layers default from common/common.h:382:
int32_t n_gpu_layers = -1; // number of layers to store in VRAM, -1 is auto, <= -2 is all