Implementation:Microsoft Onnxruntime SaveCheckpoint
Overview
Serializes the current CheckpointState (model parameters, optimizer states, and user properties) to a flatbuffers-encoded checkpoint file on disk.
Metadata
| Field | Value |
|---|---|
| Implementation Name | SaveCheckpoint |
| Type | API Doc |
| Language | C++ and Python |
| API | Python: CheckpointState.save_checkpoint(state, path), C++: SaveCheckpoint(const CheckpointState& state, const PathString& checkpoint_path, const bool include_optimizer_state) -> Status
|
| Import | from onnxruntime.training.api import CheckpointState
|
| Domain | On_Device_Training, Training_Infrastructure |
| Repository | microsoft/onnxruntime |
| Source Reference | orttraining/orttraining/training_api/checkpoint.cc:L951 (C++) |
| Last Updated | 2026-02-10 |
Description
The SaveCheckpoint function serializes the complete training state to disk using the FlatBuffers format. The C++ implementation enforces a little-endian check before delegating to the internal save::FromCheckpointState function.
An additional overload (available only in non-minimal builds) accepts raw TensorProto objects for saving ONNX initializers directly. This overload supports external data files when tensor data exceeds the configurable threshold (default 1.8 GB).
The include_optimizer_state parameter controls whether the optimizer's per-parameter momentum buffers are included, allowing users to create smaller checkpoints when only model parameters are needed.
API Signature
C++
namespace onnxruntime::training::api {
Status SaveCheckpoint(const CheckpointState& state,
const PathString& checkpoint_path,
const bool include_optimizer_state);
#if !defined(ORT_MINIMAL_BUILD)
Status SaveCheckpoint(gsl::span<const ONNX_NAMESPACE::TensorProto> trainable_tensor_protos,
gsl::span<const ONNX_NAMESPACE::TensorProto> non_trainable_tensor_protos,
const PathString& checkpoint_path,
const bool nominal_checkpoint,
const size_t external_data_threshold = 1800 * 1024 * 1024);
#endif
} // namespace onnxruntime::training::api
Python
CheckpointState.save_checkpoint(state, path_to_checkpoint)
Key Parameters
| Parameter | Type | Description |
|---|---|---|
| state | const CheckpointState& / CheckpointState |
The in-memory training state to serialize |
| checkpoint_path / path_to_checkpoint | PathString / str |
File system path where the checkpoint will be saved |
| include_optimizer_state (C++) | bool |
Whether to include optimizer momentum states in the checkpoint |
| external_data_threshold (C++, TensorProto overload) | size_t |
Byte threshold above which tensor data is stored externally (default 1.8 GB) |
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | CheckpointState |
In-memory state containing parameters, optimizer states, and properties |
| Output | Checkpoint file | FlatBuffers-encoded file on disk |
| Output (optional) | External data file | Separate file for large tensor data (created when threshold exceeded) |
Code Reference
From orttraining/orttraining/training_api/checkpoint.cc:
Status SaveCheckpoint(const CheckpointState& states, const PathString& checkpoint_path,
const bool include_optimizer_state) {
ORT_RETURN_IF_NOT(FLATBUFFERS_LITTLEENDIAN,
"ORT training checkpoint format only supports little-endian machines");
return save::FromCheckpointState(states, checkpoint_path, include_optimizer_state);
}
Usage Example
Python
from onnxruntime.training.api import CheckpointState
# After training loop completes
CheckpointState.save_checkpoint(state, "checkpoints/epoch_10")
# Save user properties before checkpointing
state.properties["epoch"] = 10
state.properties["best_accuracy"] = 0.95
CheckpointState.save_checkpoint(state, "checkpoints/epoch_10_with_meta")
C++
#include "orttraining/training_api/checkpoint.h"
using namespace onnxruntime::training::api;
// Save with optimizer state for training resumption
Status status = SaveCheckpoint(checkpoint_state,
ORT_TSTR("checkpoints/step_5000"),
/*include_optimizer_state=*/true);
// Save without optimizer state for deployment
status = SaveCheckpoint(checkpoint_state,
ORT_TSTR("checkpoints/deploy"),
/*include_optimizer_state=*/false);
Implements
Principle:Microsoft_Onnxruntime_Checkpoint_Saving
Related Pages
- CheckpointState Load -- The inverse operation for loading checkpoints
- Module TrainStep -- Generates the training state that is checkpointed
- ExportModelForInferencing -- Alternative export path using checkpoint parameters