Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft Onnxruntime Inference Model Export

From Leeroopedia


Overview

Conversion of a trained model into an optimized ONNX model suitable for inference deployment.

Metadata

Field Value
Principle Name Inference_Model_Export
Category API Doc
Domain On_Device_Training, Model_Optimization
Repository microsoft/onnxruntime
Source Reference orttraining/orttraining/training_api/module.cc:L660-661 (definition), orttraining/orttraining/training_api/module.h:L145-146 (declaration)
Last Updated 2026-02-10

Description

After training, the model's eval graph is transformed for inference by embedding trained weights and removing training-specific nodes (gradient computation, optimizer). This produces a standalone ONNX inference model.

The export process performs the following transformations:

  • Output Selection -- The graph_output_names parameter specifies which model outputs to retain. Nodes not contributing to these outputs are pruned from the graph.
  • Parameter Embedding -- Trainable and non-trainable parameters, which exist as graph inputs during training, are converted to constant initializers within the inference model. The current parameter values from the CheckpointState are embedded directly into the model.
  • Graph Pruning -- Any nodes related to gradient computation, loss calculation, or optimizer logic are removed. Only the forward inference path is retained.

The resulting model is a self-contained ONNX file that can be loaded by any ONNX Runtime inference session or compatible runtime without requiring a separate checkpoint file.

This operation requires a non-minimal build of ONNX Runtime (it is excluded from ORT_MINIMAL_BUILD) because it involves ONNX model manipulation operations that depend on the full protobuf library.

Theoretical Basis

Inference models are optimized versions of training graphs with frozen weights, pruned backward passes, and additional graph optimizations for deployment efficiency.

  • Weight Freezing -- During training, parameters are mutable graph inputs. For inference, they are embedded as constant initializers, enabling further graph optimizations such as constant folding and operator fusion.
  • Graph Simplification -- Removing the backward graph, gradient accumulation nodes, and loss computation nodes reduces the model size and eliminates unnecessary computation during inference.
  • Output Specification -- Training models often produce auxiliary outputs (loss values, intermediate activations). The inference model retains only the outputs needed for the deployment use case.

Usage

Export is typically performed after training completes:

from onnxruntime.training.api import CheckpointState, Module

state = CheckpointState.load_checkpoint("checkpoints/final")
module = Module("training_model.onnx", state, "eval_model.onnx", device="cpu")

# Export the trained model for inference
module.export_model_for_inferencing(
    "inference_model.onnx",
    ["output_name_1", "output_name_2"],
)

In C++:

std::vector<std::string> output_names = {"output_name_1", "output_name_2"};
Status status = module.ExportModelForInferencing("inference_model.onnx", output_names);

Implemented By

Implementation:Microsoft_Onnxruntime_ExportModelForInferencing

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment