Principle:Microsoft Onnxruntime Inference Model Export

Overview

Conversion of a trained model into an optimized ONNX model suitable for inference deployment.

Metadata

Field	Value
Principle Name	Inference_Model_Export
Category	API Doc
Domain	On_Device_Training, Model_Optimization
Repository	microsoft/onnxruntime
Source Reference	`orttraining/orttraining/training_api/module.cc:L660-661` (definition), `orttraining/orttraining/training_api/module.h:L145-146` (declaration)
Last Updated	2026-02-10

Description

After training, the model's eval graph is transformed for inference by embedding trained weights and removing training-specific nodes (gradient computation, optimizer). This produces a standalone ONNX inference model.

The export process performs the following transformations:

Output Selection -- The graph_output_names parameter specifies which model outputs to retain. Nodes not contributing to these outputs are pruned from the graph.
Parameter Embedding -- Trainable and non-trainable parameters, which exist as graph inputs during training, are converted to constant initializers within the inference model. The current parameter values from the CheckpointState are embedded directly into the model.
Graph Pruning -- Any nodes related to gradient computation, loss calculation, or optimizer logic are removed. Only the forward inference path is retained.

The resulting model is a self-contained ONNX file that can be loaded by any ONNX Runtime inference session or compatible runtime without requiring a separate checkpoint file.

This operation requires a non-minimal build of ONNX Runtime (it is excluded from ORT_MINIMAL_BUILD) because it involves ONNX model manipulation operations that depend on the full protobuf library.

Theoretical Basis

Inference models are optimized versions of training graphs with frozen weights, pruned backward passes, and additional graph optimizations for deployment efficiency.

Weight Freezing -- During training, parameters are mutable graph inputs. For inference, they are embedded as constant initializers, enabling further graph optimizations such as constant folding and operator fusion.
Graph Simplification -- Removing the backward graph, gradient accumulation nodes, and loss computation nodes reduces the model size and eliminates unnecessary computation during inference.
Output Specification -- Training models often produce auxiliary outputs (loss values, intermediate activations). The inference model retains only the outputs needed for the deployment use case.

Usage

Export is typically performed after training completes:

from onnxruntime.training.api import CheckpointState, Module

state = CheckpointState.load_checkpoint("checkpoints/final")
module = Module("training_model.onnx", state, "eval_model.onnx", device="cpu")

# Export the trained model for inference
module.export_model_for_inferencing(
    "inference_model.onnx",
    ["output_name_1", "output_name_2"],
)

In C++:

std::vector<std::string> output_names = {"output_name_1", "output_name_2"};
Status status = module.ExportModelForInferencing("inference_model.onnx", output_names);

Implemented By

Implementation:Microsoft_Onnxruntime_ExportModelForInferencing

Related Pages

On-Device Training Loop -- Produces the trained parameters used for export
Checkpoint Saving -- Alternative way to persist training state
PyTorch Model Export -- The initial export step at the start of the pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment