Principle:Microsoft Onnxruntime Inference Execution
Metadata
| Field | Value |
|---|---|
| Principle Name | Inference_Execution |
| Repository | Microsoft_Onnxruntime |
| Source Repository | https://github.com/microsoft/onnxruntime |
| Domain | ML_Inference, Model_Optimization |
| Last Updated | 2026-02-10 |
| Workflow | Python_Inference_Pipeline |
| Pair | 5 of 6 |
Overview
Execution of forward pass computation on an ONNX model to produce predictions from input data.
Description
The session.run() method feeds input data through the loaded ONNX model graph, executing operators on the selected execution providers. It returns the requested output tensors as numpy arrays.
The method signature is:
session.run(output_names: list[str] | None, input_feed: dict[str, numpy.ndarray]) -> list[numpy.ndarray]
- output_names -- A list of output tensor names to retrieve, or
Noneto retrieve all outputs. Specifying only the needed outputs can improve performance by skipping unnecessary computation. - input_feed -- A dictionary mapping input tensor names to numpy arrays containing the input data.
- Returns -- A list of numpy arrays, one per requested output, in the same order as the output_names list.
The usage is demonstrated at docs/python/examples/plot_load_and_predict.py:L54.
Theoretical Basis
The run() method triggers the execution engine which follows the topologically-sorted graph of operators. Each operator is dispatched to the appropriate execution provider (CPU, CUDA, TensorRT, etc.) based on the assignments made during session creation.
Key aspects of the execution model:
- Topological ordering -- Operators are executed in dependency order, ensuring all inputs to an operator are available before it runs.
- Provider dispatch -- Each operator kernel runs on its assigned execution provider, with automatic data transfer between providers when needed.
- Memory management -- The runtime manages tensor memory allocation and deallocation, using pre-computed memory patterns to minimize allocation overhead.
- Selective output -- When specific output names are provided, the execution engine can potentially skip computation of unused subgraphs (depending on graph structure and optimization).
The method is synchronous -- it blocks until all computation is complete and results are available as numpy arrays in CPU memory.
Usage
The run method is invoked with the desired output names and prepared input feed:
res = sess.run([output_name], {input_name: x})
# Or get all outputs:
res = sess.run(None, {input_name: x})