Principle:LLMBook zh LLMBook zh github io Checkpoint Saving
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Training |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
The practice of persisting trained model weights and training state to disk for recovery, evaluation, or deployment.
Description
Checkpoint Saving ensures that training progress is not lost due to hardware failures, and enables model distribution after training. It involves saving the model weights (parameters) and optionally the full trainer state (optimizer, scheduler, RNG states) to a directory. The save_only_model option reduces disk usage by skipping optimizer state, which is appropriate for final checkpoints.
Usage
Use this principle at the end of training to save the final model, and optionally at regular intervals during training for fault tolerance. Save full state if you plan to resume training; save only the model for deployment.
Theoretical Basis
Checkpoint saving involves serializing:
- Model weights: The learned parameters (tensors) of the neural network.
- Trainer state (optional): Optimizer state, learning rate scheduler state, and random number generator states for exact reproducibility.
- Configuration: Model config and tokenizer files for loading.