Principle:Microsoft Onnxruntime Checkpoint Saving

Overview

Serialization of the current training state to disk for later resumption or deployment.

Metadata

Field	Value
Principle Name	Checkpoint_Saving
Category	API Doc
Domain	On_Device_Training, Training_Infrastructure
Repository	microsoft/onnxruntime
Source Reference	`orttraining/orttraining/training_api/checkpoint.cc:L951` (C++)
Last Updated	2026-02-10

Description

Checkpoint saving serializes the CheckpointState (including model parameters, optimizer states, and user properties) to a flatbuffers-encoded file. An optional external data file is created when the checkpoint exceeds size thresholds.

The saving process captures the following state:

Model Parameters -- All trainable and non-trainable parameter tensors are serialized. Each parameter's name, tensor data, shape, and data type are preserved.
Optimizer States -- When include_optimizer_state is true, the optimizer's per-parameter momentum buffers (e.g., first-order and second-order moments for AdamW) are serialized along with the group-level state (step count, learning rate).
User Properties -- Any key-value properties stored in the PropertyBag (e.g., epoch number, custom metrics) are included in the checkpoint.

The checkpoint format uses FlatBuffers serialization for efficient storage and fast deserialization. When the total data size exceeds the external data threshold (default 1.8 GB), tensor data is written to a separate external file to avoid exceeding the 2 GB limit imposed by 32-bit offsets in the FlatBuffers format.

The include_optimizer_state parameter controls whether optimizer states are saved. Setting it to false produces a smaller checkpoint suitable for inference deployment or fine-tuning with a fresh optimizer, while true preserves the complete training state for exact resumption.

Theoretical Basis

Persistent checkpointing enables fault-tolerant training and transfer learning by preserving the complete training state at any point during training.

Fault Tolerance -- Regular checkpoint saving ensures that at most one checkpoint interval of training progress is lost in the event of a failure. The cost of checkpointing is a trade-off between I/O overhead and the potential cost of lost training time.
Training Resumption -- A saved checkpoint can be loaded to resume training from exactly where it left off, including optimizer momentum states. Without optimizer states, the optimizer must "warm up" again, potentially causing a temporary increase in loss.
Model Deployment -- Checkpoints serve as the source of trained weights for inference model export. The parameters stored in the checkpoint can be embedded into a standalone inference ONNX model.

Usage

from onnxruntime.training.api import CheckpointState

# Save checkpoint with optimizer state (for training resumption)
CheckpointState.save_checkpoint(state, "checkpoints/step_1000")

# Save checkpoint without optimizer state (for deployment)
CheckpointState.save_checkpoint(state, "checkpoints/deploy_checkpoint")

In C++:

// Save with optimizer state
Status status = SaveCheckpoint(state, checkpoint_path, /*include_optimizer_state=*/true);

// Save without optimizer state
Status status = SaveCheckpoint(state, checkpoint_path, /*include_optimizer_state=*/false);

Implemented By

Implementation:Microsoft_Onnxruntime_SaveCheckpoint

Related Pages

Checkpoint Loading -- The inverse operation of checkpoint saving
On-Device Training Loop -- Generates the state that is checkpointed
Inference Model Export -- Uses checkpoint parameters for export

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment