Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft Onnxruntime SaveCheckpoint

From Leeroopedia


Overview

Serializes the current CheckpointState (model parameters, optimizer states, and user properties) to a flatbuffers-encoded checkpoint file on disk.

Metadata

Field Value
Implementation Name SaveCheckpoint
Type API Doc
Language C++ and Python
API Python: CheckpointState.save_checkpoint(state, path), C++: SaveCheckpoint(const CheckpointState& state, const PathString& checkpoint_path, const bool include_optimizer_state) -> Status
Import from onnxruntime.training.api import CheckpointState
Domain On_Device_Training, Training_Infrastructure
Repository microsoft/onnxruntime
Source Reference orttraining/orttraining/training_api/checkpoint.cc:L951 (C++)
Last Updated 2026-02-10

Description

The SaveCheckpoint function serializes the complete training state to disk using the FlatBuffers format. The C++ implementation enforces a little-endian check before delegating to the internal save::FromCheckpointState function.

An additional overload (available only in non-minimal builds) accepts raw TensorProto objects for saving ONNX initializers directly. This overload supports external data files when tensor data exceeds the configurable threshold (default 1.8 GB).

The include_optimizer_state parameter controls whether the optimizer's per-parameter momentum buffers are included, allowing users to create smaller checkpoints when only model parameters are needed.

API Signature

C++

namespace onnxruntime::training::api {

Status SaveCheckpoint(const CheckpointState& state,
                      const PathString& checkpoint_path,
                      const bool include_optimizer_state);

#if !defined(ORT_MINIMAL_BUILD)
Status SaveCheckpoint(gsl::span<const ONNX_NAMESPACE::TensorProto> trainable_tensor_protos,
                      gsl::span<const ONNX_NAMESPACE::TensorProto> non_trainable_tensor_protos,
                      const PathString& checkpoint_path,
                      const bool nominal_checkpoint,
                      const size_t external_data_threshold = 1800 * 1024 * 1024);
#endif

}  // namespace onnxruntime::training::api

Python

CheckpointState.save_checkpoint(state, path_to_checkpoint)

Key Parameters

Parameter Type Description
state const CheckpointState& / CheckpointState The in-memory training state to serialize
checkpoint_path / path_to_checkpoint PathString / str File system path where the checkpoint will be saved
include_optimizer_state (C++) bool Whether to include optimizer momentum states in the checkpoint
external_data_threshold (C++, TensorProto overload) size_t Byte threshold above which tensor data is stored externally (default 1.8 GB)

I/O Contract

Direction Type Description
Input CheckpointState In-memory state containing parameters, optimizer states, and properties
Output Checkpoint file FlatBuffers-encoded file on disk
Output (optional) External data file Separate file for large tensor data (created when threshold exceeded)

Code Reference

From orttraining/orttraining/training_api/checkpoint.cc:

Status SaveCheckpoint(const CheckpointState& states, const PathString& checkpoint_path,
                      const bool include_optimizer_state) {
  ORT_RETURN_IF_NOT(FLATBUFFERS_LITTLEENDIAN,
                    "ORT training checkpoint format only supports little-endian machines");
  return save::FromCheckpointState(states, checkpoint_path, include_optimizer_state);
}

Usage Example

Python

from onnxruntime.training.api import CheckpointState

# After training loop completes
CheckpointState.save_checkpoint(state, "checkpoints/epoch_10")

# Save user properties before checkpointing
state.properties["epoch"] = 10
state.properties["best_accuracy"] = 0.95
CheckpointState.save_checkpoint(state, "checkpoints/epoch_10_with_meta")

C++

#include "orttraining/training_api/checkpoint.h"

using namespace onnxruntime::training::api;

// Save with optimizer state for training resumption
Status status = SaveCheckpoint(checkpoint_state,
                               ORT_TSTR("checkpoints/step_5000"),
                               /*include_optimizer_state=*/true);

// Save without optimizer state for deployment
status = SaveCheckpoint(checkpoint_state,
                        ORT_TSTR("checkpoints/deploy"),
                        /*include_optimizer_state=*/false);

Implements

Principle:Microsoft_Onnxruntime_Checkpoint_Saving

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment