Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:PeterL1n BackgroundMattingV2 Checkpoint Interval Tuning

From Leeroopedia



Knowledge Sources
Domains Deep_Learning, Optimization
Last Updated 2026-02-09 02:00 GMT

Overview

Use checkpoint interval of 5000 steps for base training and 2000 steps for refinement training, with validation intervals matched accordingly.

Description

The training scripts save model checkpoints and run validation at configurable step intervals. The base training uses longer intervals (5000 steps) because it trains for more epochs on smaller images. The refinement training uses shorter intervals (2000 steps) because it trains for fewer epochs on much larger images with fewer iterations per epoch.

Usage

Use this heuristic when configuring training checkpointing and validation frequency. These defaults balance disk usage, training monitoring resolution, and validation overhead.

The Insight (Rule of Thumb)

  • Base training (`train_base.py`):
    • Checkpoint interval: 5000 steps
    • Validation interval: 5000 steps
    • Train image log interval: 2000 steps
    • Loss log interval: 10 steps
  • Refine training (`train_refine.py`):
    • Checkpoint interval: 2000 steps
    • Validation interval: 2000 steps
    • Train image log interval: 1000 steps
    • Loss log interval: 10 steps
  • End-of-epoch: Both scripts save an additional checkpoint at the end of each epoch.
  • Trade-off: More frequent checkpointing provides finer-grained recovery points but consumes more disk space. The refine model checkpoints are larger due to the refiner module.

Reasoning

The refine training operates at 4x higher resolution (2048 vs 512) with a smaller batch size (4 vs 8) and fewer epochs (typically 1 epoch). Each step processes significantly more data per sample, so fewer steps represent more training progress, justifying shorter checkpoint intervals. The validation set is small (50 samples) to minimize interruption. Checkpoints are saved both at fixed intervals and at epoch boundaries to ensure no progress is lost.

Code evidence from `train_base.py:62-63`:

parser.add_argument('--log-valid-interval', type=int, default=5000)
parser.add_argument('--checkpoint-interval', type=int, default=5000)

Code evidence from `train_refine.py:67-69`:

parser.add_argument('--log-train-images-interval', type=int, default=1000)
parser.add_argument('--log-valid-interval', type=int, default=2000)
parser.add_argument('--checkpoint-interval', type=int, default=2000)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment