Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Microsoft Onnxruntime Threading Configuration Tips

From Leeroopedia




Field Value
Sources docs/FAQ.md (L31-68), docs/NotesOnThreading.md
Domains Inference, Threading, Performance Tuning, CPU Optimization
Last Updated 2026-02-10

Overview

Configure ONNX Runtime threading to match your deployment scenario, whether single-threaded for latency-sensitive serving or multi-threaded for throughput-oriented batch processing.

Description

ONNX Runtime uses two distinct threading mechanisms for parallelism:

  • Intra-op parallelism -- parallelism within a single operator (e.g., parallelizing a matrix multiplication across cores). This can use either OpenMP or ORT's built-in threadpool, depending on build configuration.
  • Inter-op parallelism -- parallelism between independent operators that can execute concurrently. This always uses ORT's own threadpool, never OpenMP.

The choice between OpenMP and ORT threadpool for intra-op parallelism is determined at build time via the --use_openmp build flag. At runtime, the threading behavior is controlled through session options and environment variables. Getting threading configuration wrong can lead to thread over-subscription (too many threads competing for cores), degraded latency in single-request serving scenarios, or underutilized hardware in batch processing.

ORT provides thread abstractions for operator developers: TryParallelFor, TrySimpleParallelFor, TryBatchParallelFor, ShouldParallelize, and DegreeOfParallelism. These static methods abstract over the different implementation choices (ORT threadpool, OpenMP, or sequential execution) and should always be used instead of direct OpenMP pragmas.

Usage

Use this heuristic when:

  • Deploying ONNX Runtime for single-threaded inference (e.g., serverless, edge devices, or latency-sensitive endpoints).
  • Configuring multi-threaded batch inference on multi-core servers.
  • Diagnosing thread contention or unexpectedly slow inference.
  • Developing new operators and needing to add parallelism.

The Insight (Rule of Thumb)

For single-threaded execution:

  • If ORT was built with OpenMP: set OMP_NUM_THREADS=1 as an environment variable. The default inter_op_num_threads is already 1.
  • If ORT was built without OpenMP: set intra_op_num_threads=1 in session options. Do not change the default inter_op_num_threads (which is 1).
  • For strictly sequential execution, additionally set execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL.

Recommendation: Build ONNX Runtime without OpenMP if you only need single-threaded execution. This avoids the overhead of OpenMP thread management entirely.

For multi-threaded execution:

  • Set intra_op_num_threads to the number of physical cores (not logical cores) available.
  • Increase inter_op_num_threads only if the model has many independent operator subgraphs that can execute concurrently.
  • Use ThreadPool::ParallelSection to amortize loop entry/exit costs when an operator needs multiple parallel loops.

For operator developers:

  • Do NOT write #ifdef pragma omp in operator code. Use the threading abstractions (TryParallelFor, TrySimpleParallelFor, TryBatchParallelFor, ShouldParallelize) from threadpool.h and thread_utils.h.
  • These abstractions automatically select the correct backend (OpenMP, ORT threadpool, or sequential) based on build configuration and runtime state.

Example (Python, single-threaded):

import os
os.environ["OMP_NUM_THREADS"] = "1"
import onnxruntime as ort

opts = ort.SessionOptions()
opts.inter_op_num_threads = 1
opts.intra_op_num_threads = 1
opts.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

session = ort.InferenceSession("model.onnx", sess_options=opts)

Example (C++, single-threaded):

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");
Ort::SessionOptions session_options;
session_options.SetInterOpNumThreads(1);
session_options.SetIntraOpNumThreads(1);
Ort::Session session(env, model_path, session_options);

Reasoning

Thread over-subscription is one of the most common performance pitfalls when deploying ONNX Runtime. When multiple threads compete for a limited number of cores, context switching overhead dominates and inference latency increases. This is particularly problematic in serving scenarios where a web server already uses multiple threads or processes, each of which spawns its own ORT session with the default thread count. By explicitly setting thread counts, users avoid this contention. The separation of intra-op and inter-op threading gives users fine-grained control: intra-op parallelism helps individual heavy operators (like large matrix multiplications) complete faster, while inter-op parallelism allows independent branches of the computation graph to execute simultaneously. For most serving scenarios, keeping inter-op threads at 1 (sequential execution between ops) and tuning intra-op threads to match available cores provides the best latency. For throughput-oriented batch processing, increasing inter-op threads can improve utilization of independent subgraphs.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment