Heuristic:Microsoft Onnxruntime Threading Configuration Tips
| Field | Value |
|---|---|
| Sources | docs/FAQ.md (L31-68), docs/NotesOnThreading.md
|
| Domains | Inference, Threading, Performance Tuning, CPU Optimization |
| Last Updated | 2026-02-10 |
Overview
Configure ONNX Runtime threading to match your deployment scenario, whether single-threaded for latency-sensitive serving or multi-threaded for throughput-oriented batch processing.
Description
ONNX Runtime uses two distinct threading mechanisms for parallelism:
- Intra-op parallelism -- parallelism within a single operator (e.g., parallelizing a matrix multiplication across cores). This can use either OpenMP or ORT's built-in threadpool, depending on build configuration.
- Inter-op parallelism -- parallelism between independent operators that can execute concurrently. This always uses ORT's own threadpool, never OpenMP.
The choice between OpenMP and ORT threadpool for intra-op parallelism is determined at build time via the --use_openmp build flag. At runtime, the threading behavior is controlled through session options and environment variables. Getting threading configuration wrong can lead to thread over-subscription (too many threads competing for cores), degraded latency in single-request serving scenarios, or underutilized hardware in batch processing.
ORT provides thread abstractions for operator developers: TryParallelFor, TrySimpleParallelFor, TryBatchParallelFor, ShouldParallelize, and DegreeOfParallelism. These static methods abstract over the different implementation choices (ORT threadpool, OpenMP, or sequential execution) and should always be used instead of direct OpenMP pragmas.
Usage
Use this heuristic when:
- Deploying ONNX Runtime for single-threaded inference (e.g., serverless, edge devices, or latency-sensitive endpoints).
- Configuring multi-threaded batch inference on multi-core servers.
- Diagnosing thread contention or unexpectedly slow inference.
- Developing new operators and needing to add parallelism.
The Insight (Rule of Thumb)
For single-threaded execution:
- If ORT was built with OpenMP: set
OMP_NUM_THREADS=1as an environment variable. The defaultinter_op_num_threadsis already 1. - If ORT was built without OpenMP: set
intra_op_num_threads=1in session options. Do not change the defaultinter_op_num_threads(which is 1). - For strictly sequential execution, additionally set
execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL.
Recommendation: Build ONNX Runtime without OpenMP if you only need single-threaded execution. This avoids the overhead of OpenMP thread management entirely.
For multi-threaded execution:
- Set
intra_op_num_threadsto the number of physical cores (not logical cores) available. - Increase
inter_op_num_threadsonly if the model has many independent operator subgraphs that can execute concurrently. - Use
ThreadPool::ParallelSectionto amortize loop entry/exit costs when an operator needs multiple parallel loops.
For operator developers:
- Do NOT write
#ifdef pragma ompin operator code. Use the threading abstractions (TryParallelFor,TrySimpleParallelFor,TryBatchParallelFor,ShouldParallelize) fromthreadpool.handthread_utils.h. - These abstractions automatically select the correct backend (OpenMP, ORT threadpool, or sequential) based on build configuration and runtime state.
Example (Python, single-threaded):
import os
os.environ["OMP_NUM_THREADS"] = "1"
import onnxruntime as ort
opts = ort.SessionOptions()
opts.inter_op_num_threads = 1
opts.intra_op_num_threads = 1
opts.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
session = ort.InferenceSession("model.onnx", sess_options=opts)
Example (C++, single-threaded):
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test"); Ort::SessionOptions session_options; session_options.SetInterOpNumThreads(1); session_options.SetIntraOpNumThreads(1); Ort::Session session(env, model_path, session_options);
Reasoning
Thread over-subscription is one of the most common performance pitfalls when deploying ONNX Runtime. When multiple threads compete for a limited number of cores, context switching overhead dominates and inference latency increases. This is particularly problematic in serving scenarios where a web server already uses multiple threads or processes, each of which spawns its own ORT session with the default thread count. By explicitly setting thread counts, users avoid this contention. The separation of intra-op and inter-op threading gives users fine-grained control: intra-op parallelism helps individual heavy operators (like large matrix multiplications) complete faster, while inter-op parallelism allows independent branches of the computation graph to execute simultaneously. For most serving scenarios, keeping inter-op threads at 1 (sequential execution between ops) and tuning intra-op threads to match available cores provides the best latency. For throughput-oriented batch processing, increasing inter-op threads can improve utilization of independent subgraphs.