Heuristic:Ggml org Ggml Thread Count Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure |
| Last Updated | 2026-02-10 07:40 GMT |
Overview
Thread count defaults in GGML: global default is 4 threads, inference examples use min(4, hardware_concurrency), and training uses min(cores, (cores+4)/2) to balance utilization and system responsiveness.
Description
GGML uses multiple thread count heuristics depending on the context. The core library defaults to 4 threads (GGML_DEFAULT_N_THREADS). The GPT-2 inference examples cap at min(4, hardware_concurrency) to avoid overwhelming the system. The MNIST training example uses a more sophisticated formula: min(logical_cores, (logical_cores + 4) / 2) which scales with available cores but reserves some for system tasks. The maximum allowed thread count is 512 (GGML_MAX_N_THREADS).
Usage
Use this heuristic when configuring thread counts for GGML operations. The defaults are conservative and prioritize system responsiveness over maximum throughput. For dedicated inference servers, consider increasing the thread count to match available physical cores. For training workloads, the (cores+4)/2 formula provides a good starting point.
The Insight (Rule of Thumb)
- Action: Use the default thread count for interactive applications. Override with `n_threads` parameter for dedicated workloads.
- Value: Default is 4 threads (inference) or ~75% of cores (training). Maximum is 512 threads.
- Trade-off: More threads increase throughput but may cause contention and reduce per-thread cache efficiency. On systems with SMT/hyperthreading, using more threads than physical cores often hurts performance.
- Scaling examples:
- 4-core system: inference=4, training=4
- 8-core system: inference=4, training=6
- 16-core system: inference=4, training=10
Reasoning
The conservative default of 4 threads reflects several considerations:
- Cache thrashing: Too many threads on matrix multiplication operations can cause L1/L2 cache eviction, reducing per-thread efficiency.
- System responsiveness: Inference often runs alongside other workloads. Reserving cores prevents UI/system lag.
- Diminishing returns: For smaller models (like GPT-2 117M), adding more threads beyond 4 provides minimal speedup due to synchronization overhead.
- Training formula: The `(cores + 4) / 2` formula ensures at least 4 threads on 4+ core systems while leaving headroom on larger systems.
Code Evidence
Global default from `include/ggml.h:232-233`:
#define GGML_DEFAULT_N_THREADS 4
#define GGML_DEFAULT_GRAPH_SIZE 2048
Inference thread cap from `examples/common.h:20`:
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
Training thread formula from `examples/mnist/mnist-common.cpp:68-69`:
const int ncores_logical = std::thread::hardware_concurrency();
const int nthreads = std::min(ncores_logical, (ncores_logical + 4) / 2);
Maximum thread limit from `include/ggml.h:225`:
#define GGML_MAX_N_THREADS 512