Heuristic:PeterL1n BackgroundMattingV2 Training Batch Size And Resolution

Knowledge Sources	BackgroundMattingV2
Domains	Deep_Learning, Optimization
Last Updated	2026-02-09 02:00 GMT

Overview

Use batch size 8 at 512x512 for base training and batch size 4 at 1024-2048 resolution for refinement training, with dimensions divisible by 4.

Description

The two training stages (base and refine) operate at different resolutions and require different batch sizes to fit within GPU memory. The base model trains at 512x512 with batch size 8. The refine model trains at much higher resolution (1024-2048, randomly cropped) with batch size 4 and distributes across multiple GPUs. Input dimensions must be divisible by 4 due to the 4x4 patch-based refinement architecture.

Usage

Use this heuristic when configuring training for either MattingBase or MattingRefine. These defaults represent the authors' tested configuration and should be used as starting points. Adjust batch size based on available VRAM.

The Insight (Rule of Thumb)

Base training:
- Resolution: 512x512 (fixed crop from random 256-512 range)
- Batch size: 8 (default)
- Workers: 16 (default)
Refine training:
- Resolution: 1024-2048 (random crop, dimensions forced to multiples of 4)
- Batch size: 4 (default, split across GPUs)
- Workers: 16 (split across GPUs)
- Multi-GPU: Required; batch size must be divisible by GPU count
Hard constraint: Input width and height must be divisible by 4.
Trade-off: Larger batch size improves gradient stability but requires more VRAM. The refine stage requires 4x the resolution of the base stage, necessitating smaller batches.

Reasoning

The base model processes 6-channel input (src + bgr concatenated) at 512x512 through a full encoder-decoder, which fits comfortably in GPU memory at batch size 8. The refine model processes full-resolution patches at up to 2048x2048 and must fit both the base forward pass (at backbone_scale) and the refiner's patch operations, requiring reduced batch size. The divisible-by-4 constraint comes from the 4x4 patch grid used by the refinement network for selecting and replacing regions.

Code evidence — Base training defaults from `train_base.py:53,78`:

parser.add_argument('--batch-size', type=int, default=8)
A.PairRandomAffineAndResize((512, 512), ...)

Code evidence — Refine training defaults from `train_refine.py:60,93`:

parser.add_argument('--batch-size', type=int, default=4)
A.PairRandomAffineAndResize((2048, 2048), ...)

Code evidence — Divisible-by-4 constraint from `model/model.py:163-164`:

assert src.size(2) // 4 * 4 == src.size(2) and src.size(3) // 4 * 4 == src.size(3), \
    'src and bgr must have width and height that are divisible by 4'

Code evidence — Random crop enforces divisible-by-4 from `train_refine.py:267-268`:

W_tgt = random.choice(range(1024, 2048)) // 4 * 4
H_tgt = random.choice(range(1024, 2048)) // 4 * 4

Code evidence — Multi-GPU batch splitting from `train_refine.py:74-75`:

distributed_num_gpus = torch.cuda.device_count()
assert args.batch_size % distributed_num_gpus == 0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment