Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft Onnxruntime Checkpoint Saving

From Leeroopedia


Overview

Serialization of the current training state to disk for later resumption or deployment.

Metadata

Field Value
Principle Name Checkpoint_Saving
Category API Doc
Domain On_Device_Training, Training_Infrastructure
Repository microsoft/onnxruntime
Source Reference orttraining/orttraining/training_api/checkpoint.cc:L951 (C++)
Last Updated 2026-02-10

Description

Checkpoint saving serializes the CheckpointState (including model parameters, optimizer states, and user properties) to a flatbuffers-encoded file. An optional external data file is created when the checkpoint exceeds size thresholds.

The saving process captures the following state:

  • Model Parameters -- All trainable and non-trainable parameter tensors are serialized. Each parameter's name, tensor data, shape, and data type are preserved.
  • Optimizer States -- When include_optimizer_state is true, the optimizer's per-parameter momentum buffers (e.g., first-order and second-order moments for AdamW) are serialized along with the group-level state (step count, learning rate).
  • User Properties -- Any key-value properties stored in the PropertyBag (e.g., epoch number, custom metrics) are included in the checkpoint.

The checkpoint format uses FlatBuffers serialization for efficient storage and fast deserialization. When the total data size exceeds the external data threshold (default 1.8 GB), tensor data is written to a separate external file to avoid exceeding the 2 GB limit imposed by 32-bit offsets in the FlatBuffers format.

The include_optimizer_state parameter controls whether optimizer states are saved. Setting it to false produces a smaller checkpoint suitable for inference deployment or fine-tuning with a fresh optimizer, while true preserves the complete training state for exact resumption.

Theoretical Basis

Persistent checkpointing enables fault-tolerant training and transfer learning by preserving the complete training state at any point during training.

  • Fault Tolerance -- Regular checkpoint saving ensures that at most one checkpoint interval of training progress is lost in the event of a failure. The cost of checkpointing is a trade-off between I/O overhead and the potential cost of lost training time.
  • Training Resumption -- A saved checkpoint can be loaded to resume training from exactly where it left off, including optimizer momentum states. Without optimizer states, the optimizer must "warm up" again, potentially causing a temporary increase in loss.
  • Model Deployment -- Checkpoints serve as the source of trained weights for inference model export. The parameters stored in the checkpoint can be embedded into a standalone inference ONNX model.

Usage

from onnxruntime.training.api import CheckpointState

# Save checkpoint with optimizer state (for training resumption)
CheckpointState.save_checkpoint(state, "checkpoints/step_1000")

# Save checkpoint without optimizer state (for deployment)
CheckpointState.save_checkpoint(state, "checkpoints/deploy_checkpoint")

In C++:

// Save with optimizer state
Status status = SaveCheckpoint(state, checkpoint_path, /*include_optimizer_state=*/true);

// Save without optimizer state
Status status = SaveCheckpoint(state, checkpoint_path, /*include_optimizer_state=*/false);

Implemented By

Implementation:Microsoft_Onnxruntime_SaveCheckpoint

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment