Heuristic:Tencent Ncnn Thread Count Tuning

Knowledge Sources	ncnn OpenMP Best Practice
Domains	Optimization, Threading
Last Updated	2026-02-09 19:00 GMT

Overview

Thread count and CPU affinity tuning guide for ncnn on ARM big.LITTLE and multi-core systems, addressing high CPU usage caused by OpenMP spin-waiting.

Description

ncnn defaults to using only the big (performance) CPU cores, not all cores. This is a deliberate design choice for ARM big.LITTLE architectures where small cores would slow down inference. The default thread count is set by `get_physical_big_cpu_count()`. OpenMP threads busy-wait (spin) for 20ms by default before sleeping, which causes high CPU usage even when idle. Thread affinity binding via `set_cpu_powersave()` controls which cores are used. On x86 desktop systems, thread count should generally be set to half the core count or less, and never exceed 8 threads with Clang's libomp or 4 threads with other OpenMP implementations.

Usage

Use this heuristic when CPU usage is unexpectedly high during or after inference, when deploying on ARM big.LITTLE mobile devices (e.g., Snapdragon, Exynos, MediaTek), or when battery/power consumption is a concern. Also applicable when running multiple ncnn instances concurrently.

The Insight (Rule of Thumb)

Action 1: Bind threads to specific cores via `ncnn::set_cpu_powersave()`. Values: 0=all, 1=little cores only, 2=big cores only.
Action 2: Reduce thread count via `net.opt.num_threads`. Do not exceed 8 (Clang libomp) or 4 (other OpenMP).
Action 3: Set `net.opt.openmp_blocktime = 0` to disable spin-waiting (Clang libomp only). For libgomp, set `OMP_WAIT_POLICY=PASSIVE`.
Action 4: If using Vulkan GPU inference, consider `-DNCNN_OPENMP=OFF` at build time.
Trade-off: Fewer threads = lower power consumption but slower inference. Big cores only = 20-30% less throughput than all cores but 2x power saving. Disabling spin-wait = huge power saving with minimal latency impact.

Reasoning

OpenMP thread pools create persistent threads that busy-wait for work. The default spin time (200ms in vanilla OpenMP, reduced to 20ms by ncnn) causes measurable CPU usage even when no inference is running. On ARM big.LITTLE systems, scheduling inference threads on little cores dramatically hurts throughput because little cores may be 3-4x slower than big cores for compute-bound workloads. ncnn's default of big-cores-only is the result of extensive mobile benchmarking. Even with Vulkan GPU acceleration, OpenMP threads are still used for model loading and fp32-to-fp16 conversion.

Code evidence from `src/option.cpp:17`:

num_threads = get_physical_big_cpu_count();

OpenMP blocktime default from `src/option.cpp:28`:

openmp_blocktime = 20;

Environment variables set by ncnn (from `src/cpu.cpp`):

// ncnn disables OpenMP's built-in affinity to manage it directly
setenv("KMP_AFFINITY", "disabled", 1);
setenv("KMP_DUPLICATE_LIB_OK", "1", 1);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment