Heuristic:PeterL1n BackgroundMattingV2 Checkpoint Interval Tuning
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Optimization |
| Last Updated | 2026-02-09 02:00 GMT |
Overview
Use checkpoint interval of 5000 steps for base training and 2000 steps for refinement training, with validation intervals matched accordingly.
Description
The training scripts save model checkpoints and run validation at configurable step intervals. The base training uses longer intervals (5000 steps) because it trains for more epochs on smaller images. The refinement training uses shorter intervals (2000 steps) because it trains for fewer epochs on much larger images with fewer iterations per epoch.
Usage
Use this heuristic when configuring training checkpointing and validation frequency. These defaults balance disk usage, training monitoring resolution, and validation overhead.
The Insight (Rule of Thumb)
- Base training (`train_base.py`):
- Checkpoint interval: 5000 steps
- Validation interval: 5000 steps
- Train image log interval: 2000 steps
- Loss log interval: 10 steps
- Refine training (`train_refine.py`):
- Checkpoint interval: 2000 steps
- Validation interval: 2000 steps
- Train image log interval: 1000 steps
- Loss log interval: 10 steps
- End-of-epoch: Both scripts save an additional checkpoint at the end of each epoch.
- Trade-off: More frequent checkpointing provides finer-grained recovery points but consumes more disk space. The refine model checkpoints are larger due to the refiner module.
Reasoning
The refine training operates at 4x higher resolution (2048 vs 512) with a smaller batch size (4 vs 8) and fewer epochs (typically 1 epoch). Each step processes significantly more data per sample, so fewer steps represent more training progress, justifying shorter checkpoint intervals. The validation set is small (50 samples) to minimize interruption. Checkpoints are saved both at fixed intervals and at epoch boundaries to ensure no progress is lost.
Code evidence from `train_base.py:62-63`:
parser.add_argument('--log-valid-interval', type=int, default=5000)
parser.add_argument('--checkpoint-interval', type=int, default=5000)
Code evidence from `train_refine.py:67-69`:
parser.add_argument('--log-train-images-interval', type=int, default=1000)
parser.add_argument('--log-valid-interval', type=int, default=2000)
parser.add_argument('--checkpoint-interval', type=int, default=2000)