Principle:Microsoft Onnxruntime Inference Model Export
Overview
Conversion of a trained model into an optimized ONNX model suitable for inference deployment.
Metadata
| Field | Value |
|---|---|
| Principle Name | Inference_Model_Export |
| Category | API Doc |
| Domain | On_Device_Training, Model_Optimization |
| Repository | microsoft/onnxruntime |
| Source Reference | orttraining/orttraining/training_api/module.cc:L660-661 (definition), orttraining/orttraining/training_api/module.h:L145-146 (declaration)
|
| Last Updated | 2026-02-10 |
Description
After training, the model's eval graph is transformed for inference by embedding trained weights and removing training-specific nodes (gradient computation, optimizer). This produces a standalone ONNX inference model.
The export process performs the following transformations:
- Output Selection -- The
graph_output_namesparameter specifies which model outputs to retain. Nodes not contributing to these outputs are pruned from the graph. - Parameter Embedding -- Trainable and non-trainable parameters, which exist as graph inputs during training, are converted to constant initializers within the inference model. The current parameter values from the
CheckpointStateare embedded directly into the model. - Graph Pruning -- Any nodes related to gradient computation, loss calculation, or optimizer logic are removed. Only the forward inference path is retained.
The resulting model is a self-contained ONNX file that can be loaded by any ONNX Runtime inference session or compatible runtime without requiring a separate checkpoint file.
This operation requires a non-minimal build of ONNX Runtime (it is excluded from ORT_MINIMAL_BUILD) because it involves ONNX model manipulation operations that depend on the full protobuf library.
Theoretical Basis
Inference models are optimized versions of training graphs with frozen weights, pruned backward passes, and additional graph optimizations for deployment efficiency.
- Weight Freezing -- During training, parameters are mutable graph inputs. For inference, they are embedded as constant initializers, enabling further graph optimizations such as constant folding and operator fusion.
- Graph Simplification -- Removing the backward graph, gradient accumulation nodes, and loss computation nodes reduces the model size and eliminates unnecessary computation during inference.
- Output Specification -- Training models often produce auxiliary outputs (loss values, intermediate activations). The inference model retains only the outputs needed for the deployment use case.
Usage
Export is typically performed after training completes:
from onnxruntime.training.api import CheckpointState, Module
state = CheckpointState.load_checkpoint("checkpoints/final")
module = Module("training_model.onnx", state, "eval_model.onnx", device="cpu")
# Export the trained model for inference
module.export_model_for_inferencing(
"inference_model.onnx",
["output_name_1", "output_name_2"],
)
In C++:
std::vector<std::string> output_names = {"output_name_1", "output_name_2"};
Status status = module.ExportModelForInferencing("inference_model.onnx", output_names);
Implemented By
Implementation:Microsoft_Onnxruntime_ExportModelForInferencing
Related Pages
- On-Device Training Loop -- Produces the trained parameters used for export
- Checkpoint Saving -- Alternative way to persist training state
- PyTorch Model Export -- The initial export step at the start of the pipeline