Heuristic:Ggml org Ggml Thread Count Selection

Knowledge Sources	GGML Thread defaults in examples and core
Domains	Optimization, Infrastructure
Last Updated	2026-02-10 07:40 GMT

Overview

Thread count defaults in GGML: global default is 4 threads, inference examples use min(4, hardware_concurrency), and training uses min(cores, (cores+4)/2) to balance utilization and system responsiveness.

Description

GGML uses multiple thread count heuristics depending on the context. The core library defaults to 4 threads (GGML_DEFAULT_N_THREADS). The GPT-2 inference examples cap at min(4, hardware_concurrency) to avoid overwhelming the system. The MNIST training example uses a more sophisticated formula: min(logical_cores, (logical_cores + 4) / 2) which scales with available cores but reserves some for system tasks. The maximum allowed thread count is 512 (GGML_MAX_N_THREADS).

Usage

Use this heuristic when configuring thread counts for GGML operations. The defaults are conservative and prioritize system responsiveness over maximum throughput. For dedicated inference servers, consider increasing the thread count to match available physical cores. For training workloads, the (cores+4)/2 formula provides a good starting point.

The Insight (Rule of Thumb)

Action: Use the default thread count for interactive applications. Override with `n_threads` parameter for dedicated workloads.
Value: Default is 4 threads (inference) or ~75% of cores (training). Maximum is 512 threads.
Trade-off: More threads increase throughput but may cause contention and reduce per-thread cache efficiency. On systems with SMT/hyperthreading, using more threads than physical cores often hurts performance.
Scaling examples:
- 4-core system: inference=4, training=4
- 8-core system: inference=4, training=6
- 16-core system: inference=4, training=10

Reasoning

The conservative default of 4 threads reflects several considerations:

Cache thrashing: Too many threads on matrix multiplication operations can cause L1/L2 cache eviction, reducing per-thread efficiency.
System responsiveness: Inference often runs alongside other workloads. Reserving cores prevents UI/system lag.
Diminishing returns: For smaller models (like GPT-2 117M), adding more threads beyond 4 provides minimal speedup due to synchronization overhead.
Training formula: The `(cores + 4) / 2` formula ensures at least 4 threads on 4+ core systems while leaving headroom on larger systems.

Code Evidence

Global default from `include/ggml.h:232-233`:

#define GGML_DEFAULT_N_THREADS  4
#define GGML_DEFAULT_GRAPH_SIZE 2048

Inference thread cap from `examples/common.h:20`:

int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());

Training thread formula from `examples/mnist/mnist-common.cpp:68-69`:

const int ncores_logical = std::thread::hardware_concurrency();
const int nthreads = std::min(ncores_logical, (ncores_logical + 4) / 2);

Maximum thread limit from `include/ggml.h:225`:

#define GGML_MAX_N_THREADS      512

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment