Principle:Microsoft Onnxruntime Inference Session Creation
Metadata
| Field | Value |
|---|---|
| Principle Name | Inference_Session_Creation |
| Repository | Microsoft_Onnxruntime |
| Source Repository | https://github.com/microsoft/onnxruntime |
| Domain | ML_Inference, Model_Optimization |
| Last Updated | 2026-02-10 |
| Workflow | Python_Inference_Pipeline |
| Pair | 2 of 6 |
Overview
Creation of a runtime session that loads an ONNX model and prepares it for inference with specified execution providers.
Description
InferenceSession is the primary entry point for running inference with ONNX Runtime. It loads an ONNX model, applies graph optimizations, selects execution providers, and prepares the model for efficient execution. The session manages the lifecycle of the model from loading through graph optimization, memory planning, and operator kernel selection.
The InferenceSession constructor accepts:
- model_path (str or bytes) -- Path to an ONNX model file, or a serialized ONNX model as bytes.
- sess_options (SessionOptions, optional) -- A configured SessionOptions object controlling optimization and profiling behavior.
- providers (list[str], optional) -- An ordered list of execution providers to use. ONNX Runtime will attempt to place each operator on the first provider that supports it, falling back to subsequent providers in list order.
The API is imported from onnxruntime.capi.onnxruntime_inference_collection at onnxruntime/__init__.py:L82.
Theoretical Basis
The session encapsulates several critical phases of model preparation:
- Model Loading -- The ONNX protobuf is deserialized and the computational graph is constructed in memory.
- Graph Optimization -- Based on the configured optimization level, transformation passes are applied to the graph (constant folding, operator fusion, layout optimization).
- Execution Provider Selection -- Each node in the graph is assigned to an execution provider (CPU, CUDA, TensorRT, etc.) based on operator support and provider priority.
- Memory Planning -- Memory allocation patterns are computed to minimize peak memory usage and enable buffer reuse.
- Kernel Selection -- For each node, the appropriate operator kernel implementation is selected based on the assigned execution provider and data types.
The execution provider list follows a priority-based fallback model. Providers listed first have highest priority. If a provider cannot handle a particular operator, the next provider in the list is tried. CPUExecutionProvider is typically listed last as a universal fallback.
Usage
InferenceSession is created after configuring SessionOptions (or with default options):
import onnxruntime as rt
# Basic usage with default options
sess = rt.InferenceSession("model.onnx", providers=rt.get_available_providers())
# With explicit options and provider selection
options = rt.SessionOptions()
sess = rt.InferenceSession(
"model.onnx",
options,
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)