Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Ggml org Llama cpp Thread Count Tuning

From Leeroopedia
Knowledge Sources
Domains Optimization, CPU_Performance
Last Updated 2026-02-14 22:00 GMT

Overview

Critical CPU thread tuning heuristic: setting the thread count (-t) too high causes CPU oversaturation and dramatically reduces token generation speed.

Description

CPU thread oversaturation is one of the most common performance bottlenecks in llama.cpp. When the thread count exceeds the number of physical CPU cores, context switching overhead dominates compute time, and performance degrades sharply. The default behavior uses cpu_get_num_math() which returns the number of math-capable cores, but this may still be too high in certain configurations (e.g., shared systems, NUMA nodes, or when GPU offloading handles most of the work).

Usage

Use this heuristic when token generation speed is unexpectedly slow or when configuring thread counts for inference. This is especially important when combining CPU threads with GPU offloading (-ngl), where fewer CPU threads may actually yield better throughput.

The Insight (Rule of Thumb)

  • Action: Set -t N to the number of physical CPU cores (not logical/hyperthreaded cores).
  • Value: Start with -t 1 to test, then double until performance degrades, then scale back.
  • Trade-off: Too few threads underutilizes CPU; too many causes context-switch overhead.
  • With GPU: When using -ngl, fewer CPU threads are often optimal since the GPU handles most work.

Reasoning

The performance documentation demonstrates this empirically on a 7-core CPU with A6000 GPU:

Configuration Tokens/sec
-ngl 2000000 only (no -t) < 0.1
-t 7 only (no GPU) 1.7
-t 1 -ngl 2000000 5.5
-t 7 -ngl 2000000 8.7
-t 4 -ngl 2000000 9.1 (optimal)

With 7 physical cores and GPU offloading, -t 4 (fewer than physical cores) gives the best performance. The CPU handles only the non-offloaded operations and data transfer, so oversaturating with all cores causes contention.

The code also warns when CPU mask bits are insufficient:

From common/common.cpp:282-284:

if (n_set && n_set < cpuparams.n_threads) {
    LOG_WRN("Not enough set bits in CPU mask (%d) to satisfy requested thread count: %d\n",
            n_set, cpuparams.n_threads);
}

The default thread count logic from common/common.cpp:272:

cpuparams.n_threads = cpu_get_num_math();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment