Workflow:PeterL1n BackgroundMattingV2 Training pipeline

Knowledge Sources	BackgroundMattingV2 Real-Time High-Resolution Background Matting DeepLabV3 Pretrained Weights
Domains	Computer_Vision, Image_Matting, Training
Last Updated	2026-02-09 02:30 GMT

Overview

End-to-end two-stage training process for background matting models, progressing from coarse base network training to high-resolution selective refinement training.

Description

This workflow implements the complete training pipeline for the BackgroundMattingV2 architecture. The training follows a two-stage approach as described in the CVPR 2021 paper:

Stage 1 (Base): Train the coarse matting network (MattingBase) at reduced resolution (512x512) using compositing-based data generation. The base model learns to predict alpha mattes, foreground colors, and error maps from source-background image pairs. Training uses pretrained DeepLabV3 encoder weights for initialization.

Stage 2 (Refine): Freeze the base model weights as initialization and train the full MattingRefine network end-to-end at high resolution (2048x2048). The refinement stage adds a selective patch-based upsampling module that focuses computation on error-prone regions. This stage uses multi-GPU DistributedDataParallel training.

Both stages employ mixed-precision training with automatic mixed precision (AMP), extensive online data augmentation (shadow injection, noise, color jitter, affine transforms), and compositing-based training data generation.

Usage

Execute this workflow when you have foreground-alpha matting datasets (e.g., VideoMatte240K, PhotoMatte13K, Adobe Matting, Distinctions-646) plus a collection of background images, and want to train a background matting model from scratch or fine-tune from pretrained DeepLabV3 weights. Requires CUDA-capable GPUs; the refinement stage benefits from multiple GPUs.

Execution Steps

Step 1: Dataset preparation

Configure dataset directory paths in the centralized path configuration file. Each matting dataset requires separate directories for foreground RGB images and alpha matte images, organized into train and validation splits. A separate backgrounds dataset is also required. All datasets follow a nested directory structure where foreground and alpha directories must have matching file structures.

Key considerations:

Foreground images must be RGB (3 channels)
Alpha mattes must be single-channel grayscale
Foreground and alpha directory structures must mirror each other exactly
Background images are shared across all matting datasets
Supported datasets include VideoMatte240K, PhotoMatte13K, Distinctions-646, and Adobe Matting

Step 2: Pretrained weight acquisition

Download pretrained DeepLabV3 encoder weights for the chosen backbone architecture (ResNet50, ResNet101, or MobileNetV2). These weights provide a strong initialization for the encoder-ASPP portion of the network, significantly improving convergence. The weights are converted from DeepLabV3Plus format to match the model's internal structure.

Key considerations:

Weights are available from the VainF DeepLabV3Plus-Pytorch repository
The weight conversion handles naming differences between DeepLabV3Plus and the matting model
MobileNetV2 backbone requires additional structural remapping during weight loading
This step is optional if resuming from a previous checkpoint

Step 3: Base model training

Train the MattingBase network on composited training data at 512x512 resolution. The training loop generates source images on-the-fly by compositing foreground-alpha pairs onto random backgrounds with extensive augmentation (random shadows, Gaussian noise, color jitter, affine transforms on both foreground and background). The model is optimized using Adam with differential learning rates for backbone (1e-4) vs decoder/ASPP (5e-4) modules.

Loss function components:

L1 loss on alpha matte prediction
Sobel edge loss on alpha for boundary sharpness
Masked L1 loss on foreground prediction (only at non-transparent pixels)
MSE loss on error map prediction (self-supervised from alpha residuals)

What happens:

Mixed-precision forward pass through encoder-ASPP-decoder
Online compositing: src = fgr * pha + bgr * (1 - pha)
Shadow augmentation on 30% of samples
Gaussian noise on 40% of samples
Color jitter on 80% of background images
Periodic validation, TensorBoard logging, and checkpoint saving

Step 4: Refinement model training

Initialize MattingRefine with the trained base model weights and train the full network end-to-end at high resolution (2048x2048) using multi-GPU DistributedDataParallel. The refiner module learns to selectively upsample patches at error-prone regions identified by the base network's error map. Training uses the "sampling" refinement mode, which selects a fixed number of pixels to refine per forward pass.

Key considerations:

Uses NCCL backend for distributed communication
SyncBatchNorm ensures consistent batch statistics across GPUs
Lower learning rates than base training: backbone/ASPP at 5e-5, decoder at 1e-4, refiner at 3e-4
Loss operates at both coarse and full resolution with Sobel edge terms at both scales
Batch size is split evenly across available GPUs

Step 5: Checkpoint evaluation and selection

Review training logs in TensorBoard to monitor convergence of training and validation losses. Select the best checkpoint based on validation loss. The training scripts save checkpoints at regular intervals and at the end of each epoch, producing .pth state dict files suitable for inference or export.

What happens:

TensorBoard logs contain scalar loss curves and visual predictions (alpha, foreground, composite, error maps)
Checkpoints are saved to the checkpoint/{model-name}/ directory
Each checkpoint contains the model's state_dict (weights only, not architecture)

Execution Diagram

GitHub URL

Workflow Repository