Principle:Microsoft Onnxruntime Checkpoint Saving
Overview
Serialization of the current training state to disk for later resumption or deployment.
Metadata
| Field | Value |
|---|---|
| Principle Name | Checkpoint_Saving |
| Category | API Doc |
| Domain | On_Device_Training, Training_Infrastructure |
| Repository | microsoft/onnxruntime |
| Source Reference | orttraining/orttraining/training_api/checkpoint.cc:L951 (C++)
|
| Last Updated | 2026-02-10 |
Description
Checkpoint saving serializes the CheckpointState (including model parameters, optimizer states, and user properties) to a flatbuffers-encoded file. An optional external data file is created when the checkpoint exceeds size thresholds.
The saving process captures the following state:
- Model Parameters -- All trainable and non-trainable parameter tensors are serialized. Each parameter's name, tensor data, shape, and data type are preserved.
- Optimizer States -- When
include_optimizer_stateis true, the optimizer's per-parameter momentum buffers (e.g., first-order and second-order moments for AdamW) are serialized along with the group-level state (step count, learning rate). - User Properties -- Any key-value properties stored in the
PropertyBag(e.g., epoch number, custom metrics) are included in the checkpoint.
The checkpoint format uses FlatBuffers serialization for efficient storage and fast deserialization. When the total data size exceeds the external data threshold (default 1.8 GB), tensor data is written to a separate external file to avoid exceeding the 2 GB limit imposed by 32-bit offsets in the FlatBuffers format.
The include_optimizer_state parameter controls whether optimizer states are saved. Setting it to false produces a smaller checkpoint suitable for inference deployment or fine-tuning with a fresh optimizer, while true preserves the complete training state for exact resumption.
Theoretical Basis
Persistent checkpointing enables fault-tolerant training and transfer learning by preserving the complete training state at any point during training.
- Fault Tolerance -- Regular checkpoint saving ensures that at most one checkpoint interval of training progress is lost in the event of a failure. The cost of checkpointing is a trade-off between I/O overhead and the potential cost of lost training time.
- Training Resumption -- A saved checkpoint can be loaded to resume training from exactly where it left off, including optimizer momentum states. Without optimizer states, the optimizer must "warm up" again, potentially causing a temporary increase in loss.
- Model Deployment -- Checkpoints serve as the source of trained weights for inference model export. The parameters stored in the checkpoint can be embedded into a standalone inference ONNX model.
Usage
from onnxruntime.training.api import CheckpointState
# Save checkpoint with optimizer state (for training resumption)
CheckpointState.save_checkpoint(state, "checkpoints/step_1000")
# Save checkpoint without optimizer state (for deployment)
CheckpointState.save_checkpoint(state, "checkpoints/deploy_checkpoint")
In C++:
// Save with optimizer state
Status status = SaveCheckpoint(state, checkpoint_path, /*include_optimizer_state=*/true);
// Save without optimizer state
Status status = SaveCheckpoint(state, checkpoint_path, /*include_optimizer_state=*/false);
Implemented By
Implementation:Microsoft_Onnxruntime_SaveCheckpoint
Related Pages
- Checkpoint Loading -- The inverse operation of checkpoint saving
- On-Device Training Loop -- Generates the state that is checkpointed
- Inference Model Export -- Uses checkpoint parameters for export