Workflow:PeterL1n BackgroundMattingV2 Training pipeline
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Image_Matting, Training |
| Last Updated | 2026-02-09 02:30 GMT |
Overview
End-to-end two-stage training process for background matting models, progressing from coarse base network training to high-resolution selective refinement training.
Description
This workflow implements the complete training pipeline for the BackgroundMattingV2 architecture. The training follows a two-stage approach as described in the CVPR 2021 paper:
Stage 1 (Base): Train the coarse matting network (MattingBase) at reduced resolution (512x512) using compositing-based data generation. The base model learns to predict alpha mattes, foreground colors, and error maps from source-background image pairs. Training uses pretrained DeepLabV3 encoder weights for initialization.
Stage 2 (Refine): Freeze the base model weights as initialization and train the full MattingRefine network end-to-end at high resolution (2048x2048). The refinement stage adds a selective patch-based upsampling module that focuses computation on error-prone regions. This stage uses multi-GPU DistributedDataParallel training.
Both stages employ mixed-precision training with automatic mixed precision (AMP), extensive online data augmentation (shadow injection, noise, color jitter, affine transforms), and compositing-based training data generation.
Usage
Execute this workflow when you have foreground-alpha matting datasets (e.g., VideoMatte240K, PhotoMatte13K, Adobe Matting, Distinctions-646) plus a collection of background images, and want to train a background matting model from scratch or fine-tune from pretrained DeepLabV3 weights. Requires CUDA-capable GPUs; the refinement stage benefits from multiple GPUs.
Execution Steps
Step 1: Dataset preparation
Configure dataset directory paths in the centralized path configuration file. Each matting dataset requires separate directories for foreground RGB images and alpha matte images, organized into train and validation splits. A separate backgrounds dataset is also required. All datasets follow a nested directory structure where foreground and alpha directories must have matching file structures.
Key considerations:
- Foreground images must be RGB (3 channels)
- Alpha mattes must be single-channel grayscale
- Foreground and alpha directory structures must mirror each other exactly
- Background images are shared across all matting datasets
- Supported datasets include VideoMatte240K, PhotoMatte13K, Distinctions-646, and Adobe Matting
Step 2: Pretrained weight acquisition
Download pretrained DeepLabV3 encoder weights for the chosen backbone architecture (ResNet50, ResNet101, or MobileNetV2). These weights provide a strong initialization for the encoder-ASPP portion of the network, significantly improving convergence. The weights are converted from DeepLabV3Plus format to match the model's internal structure.
Key considerations:
- Weights are available from the VainF DeepLabV3Plus-Pytorch repository
- The weight conversion handles naming differences between DeepLabV3Plus and the matting model
- MobileNetV2 backbone requires additional structural remapping during weight loading
- This step is optional if resuming from a previous checkpoint
Step 3: Base model training
Train the MattingBase network on composited training data at 512x512 resolution. The training loop generates source images on-the-fly by compositing foreground-alpha pairs onto random backgrounds with extensive augmentation (random shadows, Gaussian noise, color jitter, affine transforms on both foreground and background). The model is optimized using Adam with differential learning rates for backbone (1e-4) vs decoder/ASPP (5e-4) modules.
Loss function components:
- L1 loss on alpha matte prediction
- Sobel edge loss on alpha for boundary sharpness
- Masked L1 loss on foreground prediction (only at non-transparent pixels)
- MSE loss on error map prediction (self-supervised from alpha residuals)
What happens:
- Mixed-precision forward pass through encoder-ASPP-decoder
- Online compositing: src = fgr * pha + bgr * (1 - pha)
- Shadow augmentation on 30% of samples
- Gaussian noise on 40% of samples
- Color jitter on 80% of background images
- Periodic validation, TensorBoard logging, and checkpoint saving
Step 4: Refinement model training
Initialize MattingRefine with the trained base model weights and train the full network end-to-end at high resolution (2048x2048) using multi-GPU DistributedDataParallel. The refiner module learns to selectively upsample patches at error-prone regions identified by the base network's error map. Training uses the "sampling" refinement mode, which selects a fixed number of pixels to refine per forward pass.
Key considerations:
- Uses NCCL backend for distributed communication
- SyncBatchNorm ensures consistent batch statistics across GPUs
- Lower learning rates than base training: backbone/ASPP at 5e-5, decoder at 1e-4, refiner at 3e-4
- Loss operates at both coarse and full resolution with Sobel edge terms at both scales
- Batch size is split evenly across available GPUs
Step 5: Checkpoint evaluation and selection
Review training logs in TensorBoard to monitor convergence of training and validation losses. Select the best checkpoint based on validation loss. The training scripts save checkpoints at regular intervals and at the end of each epoch, producing .pth state dict files suitable for inference or export.
What happens:
- TensorBoard logs contain scalar loss curves and visual predictions (alpha, foreground, composite, error maps)
- Checkpoints are saved to the checkpoint/{model-name}/ directory
- Each checkpoint contains the model's state_dict (weights only, not architecture)