Principle:LLMBook zh LLMBook zh github io Checkpoint Saving

Knowledge Sources	HuggingFace Trainer LLMBook-zh
Domains	Deep_Learning, Training
Last Updated	2026-02-08 00:00 GMT

Overview

The practice of persisting trained model weights and training state to disk for recovery, evaluation, or deployment.

Description

Checkpoint Saving ensures that training progress is not lost due to hardware failures, and enables model distribution after training. It involves saving the model weights (parameters) and optionally the full trainer state (optimizer, scheduler, RNG states) to a directory. The save_only_model option reduces disk usage by skipping optimizer state, which is appropriate for final checkpoints.

Usage

Use this principle at the end of training to save the final model, and optionally at regular intervals during training for fault tolerance. Save full state if you plan to resume training; save only the model for deployment.

Theoretical Basis

Checkpoint saving involves serializing:

Model weights: The learned parameters (tensors) of the neural network.
Trainer state (optional): Optimizer state, learning rate scheduler state, and random number generator states for exact reproducibility.
Configuration: Model config and tokenizer files for loading.

Related Pages

Implemented By

Implementation:LLMBook_zh_LLMBook_zh_github_io_Trainer_Save_Model_Pretraining

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment