Principle:Microsoft Onnxruntime Inference Execution

Metadata

Field	Value
Principle Name	Inference_Execution
Repository	Microsoft_Onnxruntime
Source Repository	https://github.com/microsoft/onnxruntime
Domain	ML_Inference, Model_Optimization
Last Updated	2026-02-10
Workflow	Python_Inference_Pipeline
Pair	5 of 6

Overview

Execution of forward pass computation on an ONNX model to produce predictions from input data.

Description

The session.run() method feeds input data through the loaded ONNX model graph, executing operators on the selected execution providers. It returns the requested output tensors as numpy arrays.

The method signature is:

session.run(output_names: list[str] | None, input_feed: dict[str, numpy.ndarray]) -> list[numpy.ndarray]

output_names -- A list of output tensor names to retrieve, or None to retrieve all outputs. Specifying only the needed outputs can improve performance by skipping unnecessary computation.
input_feed -- A dictionary mapping input tensor names to numpy arrays containing the input data.
Returns -- A list of numpy arrays, one per requested output, in the same order as the output_names list.

The usage is demonstrated at docs/python/examples/plot_load_and_predict.py:L54.

Theoretical Basis

The run() method triggers the execution engine which follows the topologically-sorted graph of operators. Each operator is dispatched to the appropriate execution provider (CPU, CUDA, TensorRT, etc.) based on the assignments made during session creation.

Key aspects of the execution model:

Topological ordering -- Operators are executed in dependency order, ensuring all inputs to an operator are available before it runs.
Provider dispatch -- Each operator kernel runs on its assigned execution provider, with automatic data transfer between providers when needed.
Memory management -- The runtime manages tensor memory allocation and deallocation, using pre-computed memory patterns to minimize allocation overhead.
Selective output -- When specific output names are provided, the execution engine can potentially skip computation of unused subgraphs (depending on graph structure and optimization).

The method is synchronous -- it blocks until all computation is complete and results are available as numpy arrays in CPU memory.

Usage

The run method is invoked with the desired output names and prepared input feed:

res = sess.run([output_name], {input_name: x})
# Or get all outputs:
res = sess.run(None, {input_name: x})

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment