Heuristic:Ggml org Llama cpp Thread Count Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, CPU_Performance |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
Critical CPU thread tuning heuristic: setting the thread count (-t) too high causes CPU oversaturation and dramatically reduces token generation speed.
Description
CPU thread oversaturation is one of the most common performance bottlenecks in llama.cpp. When the thread count exceeds the number of physical CPU cores, context switching overhead dominates compute time, and performance degrades sharply. The default behavior uses cpu_get_num_math() which returns the number of math-capable cores, but this may still be too high in certain configurations (e.g., shared systems, NUMA nodes, or when GPU offloading handles most of the work).
Usage
Use this heuristic when token generation speed is unexpectedly slow or when configuring thread counts for inference. This is especially important when combining CPU threads with GPU offloading (-ngl), where fewer CPU threads may actually yield better throughput.
The Insight (Rule of Thumb)
- Action: Set
-t Nto the number of physical CPU cores (not logical/hyperthreaded cores). - Value: Start with
-t 1to test, then double until performance degrades, then scale back. - Trade-off: Too few threads underutilizes CPU; too many causes context-switch overhead.
- With GPU: When using
-ngl, fewer CPU threads are often optimal since the GPU handles most work.
Reasoning
The performance documentation demonstrates this empirically on a 7-core CPU with A6000 GPU:
| Configuration | Tokens/sec |
|---|---|
-ngl 2000000 only (no -t) |
< 0.1 |
-t 7 only (no GPU) |
1.7 |
-t 1 -ngl 2000000 |
5.5 |
-t 7 -ngl 2000000 |
8.7 |
-t 4 -ngl 2000000 |
9.1 (optimal) |
With 7 physical cores and GPU offloading, -t 4 (fewer than physical cores) gives the best performance. The CPU handles only the non-offloaded operations and data transfer, so oversaturating with all cores causes contention.
The code also warns when CPU mask bits are insufficient:
From common/common.cpp:282-284:
if (n_set && n_set < cpuparams.n_threads) {
LOG_WRN("Not enough set bits in CPU mask (%d) to satisfy requested thread count: %d\n",
n_set, cpuparams.n_threads);
}
The default thread count logic from common/common.cpp:272:
cpuparams.n_threads = cpu_get_num_math();