Heuristic:PeterL1n BackgroundMattingV2 Mixed Precision Training

Knowledge Sources	BackgroundMattingV2 PyTorch AMP
Domains	Deep_Learning, Optimization
Last Updated	2026-02-09 02:00 GMT

Overview

Use `torch.cuda.amp.autocast` with `GradScaler` for mixed-precision training to reduce VRAM usage and accelerate training without quality loss.

Description

Both training scripts (`train_base.py` and `train_refine.py`) use PyTorch's automatic mixed precision (AMP) via the `autocast` context manager and `GradScaler`. This allows the forward pass to run in float16 where safe, while maintaining float32 for loss computation and gradient accumulation. The `GradScaler` prevents underflow in float16 gradients by dynamically scaling the loss before backpropagation.

Usage

Use this heuristic when training MattingBase or MattingRefine models. Mixed precision is enabled by default in both training scripts. It is particularly important for refinement training at 2048x2048 resolution where VRAM is a bottleneck.

The Insight (Rule of Thumb)

Action: Wrap the forward pass and loss computation in `autocast()`, and use `GradScaler` for backward/step.
Pattern:
1. Create `scaler = GradScaler()` before the training loop.
2. Wrap forward + loss in `with autocast():`.
3. Call `scaler.scale(loss).backward()` instead of `loss.backward()`.
4. Call `scaler.step(optimizer)` instead of `optimizer.step()`.
5. Call `scaler.update()` after each step.
Trade-off: Reduces VRAM by ~30-40% and speeds up training on Tensor Core GPUs (Volta+). No quality degradation observed in this architecture.

Reasoning

The matting network's forward pass involves large feature maps at high resolution. Mixed precision allows these tensors to use float16, halving their memory footprint. The `GradScaler` dynamically adjusts the loss scale to prevent gradient underflow that can occur with float16. This is standard practice for modern deep learning training and is critical for fitting the 2048x2048 training resolution of the refine stage within typical GPU VRAM budgets.

Code evidence from `train_base.py:25-26,133`:

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

Training loop pattern from `train_base.py:184-191`:

with autocast():
    pred_pha, pred_fgr, pred_err = model(true_src, true_bgr)[:3]
    loss = compute_loss(pred_pha, pred_fgr, pred_err, true_pha, true_fgr)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()

Same pattern used in `train_refine.py:211-218` for distributed training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment