Environment:PeterL1n BackgroundMattingV2 PyTorch CUDA

Knowledge Sources	BackgroundMattingV2 PyTorch
Domains	Computer_Vision, Deep_Learning
Last Updated	2026-02-09 02:00 GMT

Overview

Linux environment with CUDA-capable GPU, Python 3, PyTorch 1.7+, torchvision 0.8+, and kornia 0.4+ for background matting training and inference.

Description

This environment provides the GPU-accelerated runtime required for all BackgroundMattingV2 operations: training, inference, model export, and speed testing. The stack is built on PyTorch with CUDA support and includes specialized computer vision libraries (kornia for differentiable augmentations and loss computation, OpenCV for video I/O). Training uses mixed-precision via torch.cuda.amp (autocast + GradScaler) and optionally distributed data-parallel training via NCCL backend. ONNX export and validation require onnxruntime as an additional optional dependency.

Usage

Use this environment for all BackgroundMattingV2 workflows: base model training, refinement model training, image/video/webcam inference, TorchScript export, ONNX export, and speed benchmarking. Every implementation in this repository calls `.cuda()` or `.to(device)` and requires a CUDA-capable GPU for standard operation. CPU-only inference is supported for image/video pipelines via `--device cpu` but is not practical for real-time use.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	Webcam plugin only works on Linux; Windows needs `num_workers=0`
Hardware	NVIDIA GPU	RTX 2080 Ti achieves 4K@30fps, HD@60fps per paper
VRAM	8GB minimum	16GB+ recommended for training at 2048x2048 resolution
Disk	10GB+ SSD	For datasets, checkpoints, and TensorBoard logs

Dependencies

System Packages

CUDA toolkit (compatible with PyTorch 1.7+)
`ffmpeg` (for video encoding/decoding)
`v4l2loopback` (optional, for virtual webcam on Linux)

Python Packages

`torch` == 1.7.0
`torchvision` == 0.8.1
`kornia` == 0.4.1
`tensorboard` == 2.3.0
`tqdm` == 4.51.0
`opencv-python` == 4.4.0.44
`onnxruntime` == 1.6.0 (optional, for ONNX export validation)

Credentials

No credentials or API keys are required. All data and model weights are loaded from local filesystem paths configured in `data_path.py`.

The distributed training backend uses `MASTER_ADDR` and `MASTER_PORT` environment variables, but these are set automatically by the training script (localhost with random port 12300-12399).

Quick Install

pip install torch==1.7.0 torchvision==0.8.1 kornia==0.4.1 tensorboard==2.3.0 tqdm==4.51.0 opencv-python==4.4.0.44

# For ONNX export validation (optional)
pip install onnxruntime==1.6.0

Code Evidence

CUDA usage in training from `train_base.py:121`:

model = MattingBase(args.model_backbone).cuda()

Mixed-precision training setup from `train_base.py:25-26,133,184,188-190`:

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

with autocast():
    pred_pha, pred_fgr, pred_err = model(true_src, true_bgr)[:3]
    loss = compute_loss(pred_pha, pred_fgr, pred_err, true_pha, true_fgr)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Distributed training with NCCL from `train_refine.py:74-75,83-85`:

distributed_num_gpus = torch.cuda.device_count()
assert args.batch_size % distributed_num_gpus == 0

os.environ['MASTER_ADDR'] = addr
os.environ['MASTER_PORT'] = port
dist.init_process_group("nccl", rank=rank, world_size=distributed_num_gpus)

SyncBatchNorm and DistributedDataParallel from `train_refine.py:147-148`:

model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
model_distributed = nn.parallel.DistributedDataParallel(model, device_ids=[rank])

Device selection for inference from `inference_video.py:62,115`:

parser.add_argument('--device', type=str, choices=['cpu', 'cuda'], default='cuda')
device = torch.device(args.device)

Common Errors

Error Message	Cause	Solution
`CUDA out of memory`	Insufficient VRAM for training resolution	Reduce `--batch-size` or use lower resolution crops
`RuntimeError: Expected all tensors to be on the same device`	Mixed CPU/GPU tensors	Ensure all inputs are moved to the same device with `.cuda()` or `.to(device)`
`AssertionError: batch_size % distributed_num_gpus == 0`	Batch size not divisible by GPU count	Set `--batch-size` to a multiple of available GPUs
`dist.init_process_group` failures	NCCL backend not available	Install PyTorch with CUDA support; ensure NCCL is available

Compatibility Notes

Windows: The `inference_images.py` script documents that Windows requires `--num-workers 0` (single-threaded DataLoader) due to multiprocessing limitations.
CPU inference: Image and video inference scripts support `--device cpu` but performance is not practical for real-time applications.
Webcam plugin: The virtual camera feature only works on Linux with `v4l2loopback`.
ONNX Runtime: The authors note ONNX inference is "much slower than PyTorch/TorchScript" and recommend PyTorch or TorchScript for production use.
Real-time performance: The paper reports 4K@30fps and HD@60fps on an RTX 2080 Ti GPU. The provided video scripts are not real-time due to software video encoding overhead.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment