Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LLMBook zh LLMBook zh github io Checkpoint Saving

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Training
Last Updated 2026-02-08 00:00 GMT

Overview

The practice of persisting trained model weights and training state to disk for recovery, evaluation, or deployment.

Description

Checkpoint Saving ensures that training progress is not lost due to hardware failures, and enables model distribution after training. It involves saving the model weights (parameters) and optionally the full trainer state (optimizer, scheduler, RNG states) to a directory. The save_only_model option reduces disk usage by skipping optimizer state, which is appropriate for final checkpoints.

Usage

Use this principle at the end of training to save the final model, and optionally at regular intervals during training for fault tolerance. Save full state if you plan to resume training; save only the model for deployment.

Theoretical Basis

Checkpoint saving involves serializing:

  1. Model weights: The learned parameters (tensors) of the neural network.
  2. Trainer state (optional): Optimizer state, learning rate scheduler state, and random number generator states for exact reproducibility.
  3. Configuration: Model config and tokenizer files for loading.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment