Implementation:Microsoft Onnxruntime SaveCheckpoint

Overview

Serializes the current CheckpointState (model parameters, optimizer states, and user properties) to a flatbuffers-encoded checkpoint file on disk.

Metadata

Field	Value
Implementation Name	SaveCheckpoint
Type	API Doc
Language	C++ and Python
API	Python: `CheckpointState.save_checkpoint(state, path)`, C++: `SaveCheckpoint(const CheckpointState& state, const PathString& checkpoint_path, const bool include_optimizer_state) -> Status`
Import	`from onnxruntime.training.api import CheckpointState`
Domain	On_Device_Training, Training_Infrastructure
Repository	microsoft/onnxruntime
Source Reference	orttraining/orttraining/training_api/checkpoint.cc:L951 (C++)
Last Updated	2026-02-10

Description

The SaveCheckpoint function serializes the complete training state to disk using the FlatBuffers format. The C++ implementation enforces a little-endian check before delegating to the internal save::FromCheckpointState function.

An additional overload (available only in non-minimal builds) accepts raw TensorProto objects for saving ONNX initializers directly. This overload supports external data files when tensor data exceeds the configurable threshold (default 1.8 GB).

The include_optimizer_state parameter controls whether the optimizer's per-parameter momentum buffers are included, allowing users to create smaller checkpoints when only model parameters are needed.

API Signature

C++

namespace onnxruntime::training::api {

Status SaveCheckpoint(const CheckpointState& state,
                      const PathString& checkpoint_path,
                      const bool include_optimizer_state);

#if !defined(ORT_MINIMAL_BUILD)
Status SaveCheckpoint(gsl::span<const ONNX_NAMESPACE::TensorProto> trainable_tensor_protos,
                      gsl::span<const ONNX_NAMESPACE::TensorProto> non_trainable_tensor_protos,
                      const PathString& checkpoint_path,
                      const bool nominal_checkpoint,
                      const size_t external_data_threshold = 1800 * 1024 * 1024);
#endif

}  // namespace onnxruntime::training::api

Python

CheckpointState.save_checkpoint(state, path_to_checkpoint)

Key Parameters

Parameter	Type	Description
state	`const CheckpointState&` / `CheckpointState`	The in-memory training state to serialize
checkpoint_path / path_to_checkpoint	`PathString` / `str`	File system path where the checkpoint will be saved
include_optimizer_state (C++)	`bool`	Whether to include optimizer momentum states in the checkpoint
external_data_threshold (C++, TensorProto overload)	`size_t`	Byte threshold above which tensor data is stored externally (default 1.8 GB)

I/O Contract

Direction	Type	Description
Input	`CheckpointState`	In-memory state containing parameters, optimizer states, and properties
Output	Checkpoint file	FlatBuffers-encoded file on disk
Output (optional)	External data file	Separate file for large tensor data (created when threshold exceeded)

Code Reference

From orttraining/orttraining/training_api/checkpoint.cc:

Status SaveCheckpoint(const CheckpointState& states, const PathString& checkpoint_path,
                      const bool include_optimizer_state) {
  ORT_RETURN_IF_NOT(FLATBUFFERS_LITTLEENDIAN,
                    "ORT training checkpoint format only supports little-endian machines");
  return save::FromCheckpointState(states, checkpoint_path, include_optimizer_state);
}

Usage Example

Python

from onnxruntime.training.api import CheckpointState

# After training loop completes
CheckpointState.save_checkpoint(state, "checkpoints/epoch_10")

# Save user properties before checkpointing
state.properties["epoch"] = 10
state.properties["best_accuracy"] = 0.95
CheckpointState.save_checkpoint(state, "checkpoints/epoch_10_with_meta")

C++

#include "orttraining/training_api/checkpoint.h"

using namespace onnxruntime::training::api;

// Save with optimizer state for training resumption
Status status = SaveCheckpoint(checkpoint_state,
                               ORT_TSTR("checkpoints/step_5000"),
                               /*include_optimizer_state=*/true);

// Save without optimizer state for deployment
status = SaveCheckpoint(checkpoint_state,
                        ORT_TSTR("checkpoints/deploy"),
                        /*include_optimizer_state=*/false);

Implements

Principle:Microsoft_Onnxruntime_Checkpoint_Saving

Related Pages

CheckpointState Load -- The inverse operation for loading checkpoints
Module TrainStep -- Generates the training state that is checkpointed
ExportModelForInferencing -- Alternative export path using checkpoint parameters

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment